Prerequisites

Before attempting this problem, you should be comfortable with:

  • Graph Traversal (DFS/BFS) - Web pages form a graph where URLs are nodes and links are edges; traversal explores all reachable pages
  • Hash Sets - Used to track visited URLs and prevent infinite loops from cycles
  • String Parsing - Extracting hostnames from URLs to filter which pages to crawl

Intuition

Web crawling naturally fits a graph traversal problem where URLs are nodes and links between them are edges. The key insight is that we need to explore all reachable URLs from the starting URL while staying within the same hostname. dfs allows us to follow links deeply before backtracking, using a visited set to avoid infinite loops from cycles.

Algorithm

  1. Create a helper function to extract the hostname from a URL by splitting on slashes and taking the third element.
  2. Extract the starting hostname and initialize an empty visited set.
  3. Define a recursive dfs function that marks the current URL as visited.
  4. For each URL returned by htmlParser.getUrls(), check if it has the same hostname and has not been visited.
  5. Recursively call dfs on unvisited URLs with matching hostname.
  6. Return the visited set containing all crawled URLs.
class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_hostname(url):
            # split url by slashes
            # for instance, "http://example.org/foo/bar" will be split into
            # "http:", "", "example.org", "foo", "bar"
            # the hostname is the 2-nd (0-indexed) element
            return url.split('/')[2]

        start_hostname = get_hostname(startUrl)
        visited = set()

        def dfs(url, htmlParser):
            visited.add(url)
            for next_url in htmlParser.getUrls(url):
                if get_hostname(next_url) == start_hostname and next_url not in visited:
                    dfs(next_url, htmlParser)

        dfs(startUrl, htmlParser)
        return visited

Time & Space Complexity

  • Time complexity: O(ml)O(m \cdot l)
  • Space complexity: O(ml)O(m \cdot l)

Where mm is the number of edges in the graph, and ll is the maximum length of a URL (urls[i].length).


Intuition

bfs provides an alternative traversal that explores URLs level by level, visiting all URLs at distance 1 before distance 2, and so on. This approach uses a queue instead of recursion and naturally discovers URLs in order of their distance from the starting URL.

Algorithm

  1. Create a helper function to extract the hostname from a URL.
  2. Extract the starting hostname, initialize a queue with the start URL, and create a visited set containing the start URL.
  3. While the queue is not empty, dequeue a URL.
  4. For each URL returned by htmlParser.getUrls(), check if it has the same hostname and has not been visited.
  5. If valid, add the URL to both the queue and the visited set.
  6. Return the visited set containing all crawled URLs after the queue is exhausted.
class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_hostname(url):
            # split url by slashes
            # for instance, "http://example.org/foo/bar" will be split into
            # "http:", "", "example.org", "foo", "bar"
            # the hostname is the 2-nd (0-indexed) element
            return url.split('/')[2]

        start_hostname = get_hostname(startUrl)
        q = collections.deque([startUrl])
        visited = set([startUrl])
        while q:
            url = q.popleft()
            for next_url in htmlParser.getUrls(url):
                if get_hostname(next_url) == start_hostname and next_url not in visited:
                    q.append(next_url)
                    visited.add(next_url)
        return visited

Time & Space Complexity

  • Time complexity: O(ml)O(m \cdot l)
  • Space complexity: O(nl)O(n \cdot l)

Where mm is the number of edges in the graph, ll is the maximum length of a URL (urls[i].length), and nn is the total number of URLs (urls.length).


Common Pitfalls

Incorrect Hostname Extraction

Extracting the hostname incorrectly is a frequent mistake. URLs follow the format http://hostname/path, so the hostname is the third element when splitting by /. Off-by-one errors or not accounting for URLs without paths can cause the crawler to include or exclude wrong domains.

Not Marking URLs as Visited Before Adding to Queue

In BFS, you must mark a URL as visited when adding it to the queue, not when processing it. If you wait until dequeuing, multiple threads or iterations might add the same URL to the queue multiple times, leading to redundant work and potentially infinite loops.

Visiting URLs Outside the Starting Hostname

The crawler must only visit URLs with the same hostname as the starting URL. Failing to filter URLs by hostname before crawling will cause the solution to explore unrelated domains, violating the problem constraints and potentially causing timeouts or incorrect results.