Focused crawler

Focused crawler

A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled examples of relevant and not relevant pages are available. Topical crawling was first introduced by Menczer [ Menczer, F. (1997). [http://informatics.indiana.edu/fil/Papers/ICML.ps ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery] . In D. Fisher, ed., Proceedings of the 14th International Conference on Machine Learning (ICML97). Morgan Kaufmann.] [ Menczer, F. and Belew, R.K. (1998). [http://informatics.indiana.edu/fil/Papers/AA98.ps Adaptive Information Agents in Distributed Textual Environments] . In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents '98). ACM Press.] . Focused crawling was first introduced by Chakrabarti et al. [Chakrabarti, S., van den Berg, M., and Dom, B. (1999). [http://citeseer.ist.psu.edu/293642.html Focused crawling: A new approach to topic-specific web resource discovery] . Computer Networks, 31(11–16):1623–1640.] .

Strategies

A focused crawler ideally would like to download only web pages that are relevant to a particular topic and avoid downloading all others. Therefore a focused crawler must predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton [Pinkerton, B. (1994). [http://www.thinkpink.com/bp/WebCrawler/WWW94.html Finding what people want: Experiences with the WebCrawler] . In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.] in a crawler developed in the early days of the Web. In a review of topical crawling algorithms, Menczer "et al." [Menczer, F., Pant, G., and Srinivasan, P. (2004). [http://doi.acm.org/10.1145/1031114.1031117 Topical Web Crawlers: Evaluating Adaptive Algorithms] . ACM Trans. on Internet Technology 4(4): 378–419.] show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. Diligenti "et al." [Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). [http://nautilus.dii.unisi.it/pubblicazioni/files/conference/2000-Diligenti-VLDB.pdf Focused crawling using context graphs] . In Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.] propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points.

References

See also

*Spider trap
*Web crawler


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Crawler — Ein Webcrawler (auch Spider) ist ein Computerprogramm, das automatisch das World Wide Web durchsucht und Webseiten analysiert. Webcrawler werden vor allem von Suchmaschinen eingesetzt. Weitere Anwendungen sind das Sammeln von RSS Newsfeeds, E… …   Deutsch Wikipedia

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Web-Crawler — Ein Webcrawler (auch Spider) ist ein Computerprogramm, das automatisch das World Wide Web durchsucht und Webseiten analysiert. Webcrawler werden vor allem von Suchmaschinen eingesetzt. Weitere Anwendungen sind das Sammeln von RSS Newsfeeds, E… …   Deutsch Wikipedia

  • Web Crawler — Ein Webcrawler (auch Spider) ist ein Computerprogramm, das automatisch das World Wide Web durchsucht und Webseiten analysiert. Webcrawler werden vor allem von Suchmaschinen eingesetzt. Weitere Anwendungen sind das Sammeln von RSS Newsfeeds, E… …   Deutsch Wikipedia

  • Spider trap — A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are… …   Wikipedia

  • Invisible Web — The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet, or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines. Mike Bergman, founder of… …   Wikipedia

  • Web search engine — Search engine redirects here. For other uses, see Search engine (disambiguation). The three most widely used web search engines and their approximate share as of late 2010.[1] A web search engine is designed to search for information on the Wo …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Natural language user interface — Natural Language User Interfaces (LUI) are a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. In interface… …   Wikipedia

  • Suchroboter — Ein Webcrawler (auch Spider) ist ein Computerprogramm, das automatisch das World Wide Web durchsucht und Webseiten analysiert. Webcrawler werden vor allem von Suchmaschinen eingesetzt. Weitere Anwendungen sind das Sammeln von RSS Newsfeeds, E… …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”