Web harvesting

Web harvesting is an implementation of a Web crawler that uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge. Web harvesting can be thought of as focused or directed Web crawling.

Purpose

Web harvesting allows Web-based search and retrieval applications, commonly referred to as search engines, to index content that is pertinent to the audience for which the harvest is intended. Such content is thus virtually integrated and made searchable as a separate Web application. General purpose search engines, such as Google and Yahoo! index all possible links they encounter from the origin of their crawl. In contrast, search engines based on Web harvesting only index URLs to which they are directed. This implementation strategy has the effect of creating a searchable application that is faster, due to the reduced size of the index; and one that provides higher quality and more selective results since the indexed URLs are pre-filtered for the topic or domain of interest. In effect, harvesting makes otherwise isolated islands of information searchable as if they were an integrated whole.

Process

Web Harvesting begins by identifying and specifying as input to a computer program a list of URLs that define a specialized collection or set of knowledge. The computer program then begins to download this list of URLs. Embedded hyperlinks that are encountered can be either followed or ignored, depending on human or machine guidance. A key differentiation between Web harvesting and general purpose Web crawlers is that for Web harvesting, crawl depth will be defined and the crawls need not recursively follow URLs until all links have been exhausted. The downloaded content is then indexed by the search engine application and offered to information customers as a searchable Web application. Information customers can then access and search the Web application and follow hyperlinks to the original URLs that meet their search criteria.

Focused web harvesting

Focused web harvesting is similar to the targeted web crawler. Instead of let the general purpose crawler to harvest the web, the mechanism works under a certain pre-defined conditions to specify the information [L.T. Handoko, "A new approach for scientific data dissemination in developing countries: a case of Indonesia", Proceeding of the UN/ESA/NASA Workshops on Basic Space Science and the International Heliophysical Year, [http://arxiv.org/abs/0711.2842 arXiv:0711.2842] (2007).] [Z. Akbar and L.T. Handoko, "A Simple Mechanism for Focused Web-harvesting", Proceeding of the International Conference on Advanced Computational Intelligence and Its Applications, [http://arxiv.org/abs/0809.0723 arXiv:0809.0723] (2008).] . Especially this mechanism is intended to realize an indirect data integration. The first implementation on this kind of data integartion can be found at the Indonesian Scientific Index - ISI [ [http://www.isi.lipi.go.id Indonesian Scientific Index] ] which integrates all information related to the science and technology in Indonesia.

References

ee also

*Web crawler
*Search engine

Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

Web scraping — (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites… … Wikipedia
harvesting — UK US /ˈhɑːvɪstɪŋ/ noun [U] ► COMMERCE the process of continuing to make as much profit as possible from a product or business while spending as little as possible on it: »Companies may pursue a harvesting strategy when they have cash flow… … Financial and business terms
Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… … Wikipedia
web harvester — /ˈwɛb havəstə/ (say web hahvuhstuh) noun → web scraper. –web harvesting, noun …
Web-Archivierung — ist das Sammeln und dauerhafte Ablegen von Netzpublikationen mit dem Zweck, in der Zukunft Öffentlichkeit und Wissenschaft einen Blick in die Vergangenheit bieten zu können. Die größte internationale Einrichtung zur Web Archivierung ist das… … Deutsch Wikipedia
Web archiving — is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists… … Wikipedia
Web Catalogue Service — Unter einem Web Catalogue Service (WCAS), auch Catalogue Service for the Web (CSW) genannt, versteht man die Internet gestützte Veröffentlichung von Informationen über Geoanwendungen, Geodienste und Geodaten (Metadaten) in einer… … Deutsch Wikipedia
Web scraping — Le Web scraping (parfois appelé Harvesting) décrit généralement en informatique un moyen d extraire du contenu d un site Web, via un script ou un programme, dans le but de le transformer ou de changer son format pour permettre son utilisation… … Wikipédia en Français
Deep Web — The deep Web (also called Deepnet, the invisible Web, or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by search engines. It is estimated that the deep Web is several orders of magnitude… … Wikipedia
Reports of organ harvesting from Falun Gong practitioners in China — In March 2006, Falun Gong affiliated media The Epoch Times published a number of articles alleging that the Chinese government and its agencies, including the People s Liberation Army, were conducting widespread and systematic organ harvesting of … Wikipedia

Academic Dictionaries and Encyclopedias

Web harvesting

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Web harvesting

Look at other dictionaries:

Share the article and excerpts

Direct link