Before attempting this problem, you should be comfortable with:
Web crawling naturally fits a graph traversal problem where URLs are nodes and links between them are edges. The key insight is that we need to explore all reachable URLs from the starting URL while staying within the same hostname. dfs allows us to follow links deeply before backtracking, using a visited set to avoid infinite loops from cycles.
visited set.dfs function that marks the current URL as visited.htmlParser.getUrls(), check if it has the same hostname and has not been visited.dfs on unvisited URLs with matching hostname.visited set containing all crawled URLs.class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
def get_hostname(url):
# split url by slashes
# for instance, "http://example.org/foo/bar" will be split into
# "http:", "", "example.org", "foo", "bar"
# the hostname is the 2-nd (0-indexed) element
return url.split('/')[2]
start_hostname = get_hostname(startUrl)
visited = set()
def dfs(url, htmlParser):
visited.add(url)
for next_url in htmlParser.getUrls(url):
if get_hostname(next_url) == start_hostname and next_url not in visited:
dfs(next_url, htmlParser)
dfs(startUrl, htmlParser)
return visitedWhere is the number of edges in the graph, and is the maximum length of a URL (
urls[i].length).
bfs provides an alternative traversal that explores URLs level by level, visiting all URLs at distance 1 before distance 2, and so on. This approach uses a queue instead of recursion and naturally discovers URLs in order of their distance from the starting URL.
visited set containing the start URL.htmlParser.getUrls(), check if it has the same hostname and has not been visited.visited set.visited set containing all crawled URLs after the queue is exhausted.class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
def get_hostname(url):
# split url by slashes
# for instance, "http://example.org/foo/bar" will be split into
# "http:", "", "example.org", "foo", "bar"
# the hostname is the 2-nd (0-indexed) element
return url.split('/')[2]
start_hostname = get_hostname(startUrl)
q = collections.deque([startUrl])
visited = set([startUrl])
while q:
url = q.popleft()
for next_url in htmlParser.getUrls(url):
if get_hostname(next_url) == start_hostname and next_url not in visited:
q.append(next_url)
visited.add(next_url)
return visitedWhere is the number of edges in the graph, is the maximum length of a URL (
urls[i].length), and is the total number of URLs (urls.length).
Extracting the hostname incorrectly is a frequent mistake. URLs follow the format http://hostname/path, so the hostname is the third element when splitting by /. Off-by-one errors or not accounting for URLs without paths can cause the crawler to include or exclude wrong domains.
In BFS, you must mark a URL as visited when adding it to the queue, not when processing it. If you wait until dequeuing, multiple threads or iterations might add the same URL to the queue multiple times, leading to redundant work and potentially infinite loops.
The crawler must only visit URLs with the same hostname as the starting URL. Failing to filter URLs by hostname before crawling will cause the solution to explore unrelated domains, violating the problem constraints and potentially causing timeouts or incorrect results.