- Nutch
-
Lucene Nutch
Developer(s) Apache Software Foundation Stable release 1.3 / June 7, 2011 Development status Active Written in Java Operating system Cross-platform Type Search Engine License Apache License 2.0 Website nutch.apache.org Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.
Contents
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
History
Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]
Advantages [2]
Some of the advantages of Nutch, when compared to a simple Fetcher
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - you can run Nutch on a cluster of 100 machines
- quality - you can bias the crawling to fetch “important” pages first
Scalability
IBM Research studied the performance[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[5]
Related projects
- Hadoop - Java framework that supports distributed applications running on large clusters
- nutchWAX - Uses Nutch to search a web archive
- Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.
Search engines built with Nutch
- Creative Commons Search - launched 2004, Nutch implementation replaced 2006[6][7][8]
- DiscoverEd - Open educational resources search prototype developed by Creative Commons[9]
- Krugle
- mozDex
- Wikia Search - launched 2008, closed down 2009[10][11]
- search2.net
See also
- Faceted Search
- Information Extraction
- Enterprise Search
References
- ^ Nutch News
- ^ Using Nutch with Solr
- ^ Scalability of the Nutch search engine
- ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
- ^ http://boston.lti.cs.cmu.edu/crawler/crawlerstats.html
- ^ "Our Updated Search". Creative Commons. 2004-09-03. http://creativecommons.org/weblog/entry/4388.
- ^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. http://creativecommons.org/press-releases/entry/5064.
- ^ "New CC search UI". Creative Commons. 2006-08-02. http://creativecommons.org/weblog/entry/6002.
- ^ DiscoverEd home page
- ^ Where can I get the source code for Wikia Search?
- ^ Update on Wikia – doing more of what’s working
Bibliography
- Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. pp. 350. ISBN 978-1590596876. http://www.apress.com/book/view/9781590596876.
External links
- Official website
- Official wiki
- Building Nutch: Open Source Search(2004)- ACM Queue vol. 2, no. 2
- An article about Nutch(2003)- Search Engine Watch
- Another article about Nutch(2003)- Tech News World
- Official page of the Hadoop project
Apache Software Foundation Top level projects - Abdera
- ActiveMQ
- Ant
- Aries
- Apache HTTP Server
- APR
- Avro
- Axis
- Buildr
- Camel
- Cassandra
- Cayenne
- Chemistry
- Click
- Cocoon
- Continuum
- CouchDB
- CXF
- Derby
- Directory
- Felix
- Forrest
- Geronimo
- Gump
- Hadoop
- Hive
- HBase
- Jackrabbit
- James
- Karaf
- Lenya
- libcloud
- Mahout
- Maven
- MINA
- mod_perl
- MyFaces
- ODE
- OFBiz
- OpenEJB
- OpenJPA
- POI
- Pivot
- Qpid
- River
- Roller
- ServiceMix
- Shindig
- Shiro
- Sling
- SpamAssassin
- stdcxx
- Struts
- Subversion
- Tapestry
- Thrift
- Tomcat
- Trafficserver
- Tuscany
- UIMA
- Velocity
- Wicket
- Xerces
- XMLBeans
Jakarta Projects Commons Projects - Daemon
- Sanselan
- Jelly
Lucene Projects - Lucene Java
- Droids
- Lucene.Net
- Lucy
- Nutch
- Open Relevance Project
- PyLucene
- Solr
- Tika
Hadoop Projects - HDFS
- ZooKeeper
Other projects Incubator Projects - ACE
- Callback
- Composer
- Empire-db
- Hama
- JSPWiki
- OpenOffice.org
- XAP
- Wink
Apache Attic - License: Apache License
- Website: apache.org
Categories:- Internet search engines
- Free search engine software
- Free web crawlers
- Java libraries
- Cross-platform software
Wikimedia Foundation. 2010.