Distributed web crawling

Distributed web crawling

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaing large computing clusters are avoided. [1]

Contents

Types

Cho and Garcia-Molina (Cho and Garcia-Molina, 2002) studied two types of policies:

Dynamic assignment

With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to, for instance, dynamically balance the load of each crawler.

With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls.

There are two configurations of crawling architectures with dynamic assignments that have been described by Shkapenyuk and Suel (Shkapenyuk and Suel, 2002):

  • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders.
  • A large crawler configuration, in which the DNS resolver and the queues are also distributed.

Static assignment

With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers.

For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur.

To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002).

An effective assignment function must have three main properties: each crawling process should get approximately the same number of hosts (balancing property), if the number of crawling processes grows, the number of hosts assigned to each process must shrink (contra-variance property), and the assignment must be able to add and remove crawling processes dynamically. Boldi et al. (Boldi et al., 2004) propose to use consistent hashing, which replicates the buckets, so adding or removing a bucket does not require re-hashing of the whole table to achieve all of the desired properties.,

Implementations

As of 2003 most modern commercial search engines use this technique. Google and Yahoo use thousands of individual computers to crawl the Web.

Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique, which powers its Grub distributed web-crawling project.

This solution uses computers that are connected to the Internet to crawl Internet addresses in the background. Upon downloading crawled web pages, they are compressed and sent back together with a status flag (e.g. changed, new, down, redirected) to the powerful central servers. The servers, which manage a large database, send out new URLs to clients for testing.

Drawbacks

According to the FAQ about Nutch, an open-source search engine website, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...".

See also

Sources

  1. ^ Chen Ding et al. (2010). Network and Parallel Computing: IFIP International Conference NPC 2010. p. 91. 

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Web search engine — Search engine redirects here. For other uses, see Search engine (disambiguation). The three most widely used web search engines and their approximate share as of late 2010.[1] A web search engine is designed to search for information on the Wo …   Wikipedia

  • Web search query — A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are unstructured and often ambiguous; they vary greatly from standard query languages …   Wikipedia

  • Distributed search engine — A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in decentralized …   Wikipedia

  • ICDL crawling — is an open distributed web crawling technology based on Website Parse Template (WPT). What is Website Parse Template? Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format… …   Wikipedia

  • Web traffic — is the amount of data sent and received by visitors to a web site. It is a large portion of Internet traffic. This is determined by the number of visitors and the number of pages they visit. Sites monitor the incoming and outgoing traffic to see… …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Open Market For Internet Content Accessibility — OMFICA LTD Type Company Limited by Guarantee Industry Internet, Search Technologies Founded London, UK (February 4, 2008) …   Wikipedia

  • Robots exclusion standard — selfref| For restricting Wikipedia bots, see .|The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of …   Wikipedia

  • OpenSearch — Example of a web page which offers to add a new search plugin using the auto discovery technique. When viewing with the Firefox browser version 3, the symbol of the currently selected search engine (Google s G in the example) becomes bluish. The… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”