Nutch

Nutch: Lucene Nutch

Screenshot

Nutch Web Interface Search

Developer(s) Apache Software Foundation

Stable release 1.3 / June 7, 2011; 5 months ago (2011-06-07)

Development status Active

Written in Java

Operating system Cross-platform

Type Search Engine

License Apache License 2.0

Website nutch.apache.org

Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.

Contents

1 Features

2 History

3 Advantages ^[2]

4 Scalability

5 Related projects

6 Search engines built with Nutch

7 See also

8 References

9 Bibliography

10 External links

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.^[1]

Advantages ^[2]

Some of the advantages of Nutch, when compared to a simple Fetcher

highly scalable and relatively feature rich crawler

features like politeness which obeys robots.txt rules

robust and scalable - you can run Nutch on a cluster of 100 machines

quality - you can bias the crawling to fetch “important” pages first

Scalability

IBM Research studied the performance^[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.^[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.^[5]

Related projects

Hadoop - Java framework that supports distributed applications running on large clusters

nutchWAX - Uses Nutch to search a web archive

Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.

Search engines built with Nutch

Creative Commons Search - launched 2004, Nutch implementation replaced 2006^[6]^[7]^[8]

DiscoverEd - Open educational resources search prototype developed by Creative Commons^[9]

Krugle

mozDex

Wikia Search - launched 2008, closed down 2009^[10]^[11]

search2.net

See also

Free software portal

Faceted Search

Information Extraction

Enterprise Search

References

^ Nutch News

^ Using Nutch with Solr

^ Scalability of the Nutch search engine

^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer

^ http://boston.lti.cs.cmu.edu/crawler/crawlerstats.html

^ "Our Updated Search". Creative Commons. 2004-09-03. http://creativecommons.org/weblog/entry/4388.

^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. http://creativecommons.org/press-releases/entry/5064.

^ "New CC search UI". Creative Commons. 2006-08-02. http://creativecommons.org/weblog/entry/6002.

^ DiscoverEd home page

^ Where can I get the source code for Wikia Search?

^ Update on Wikia – doing more of what’s working

Bibliography

Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. pp. 350. ISBN 978-1590596876. http://www.apress.com/book/view/9781590596876.

External links

Official website

Official wiki

Building Nutch: Open Source Search（2004）- ACM Queue vol. 2, no. 2

An article about Nutch（2003）- Search Engine Watch

Another article about Nutch（2003）- Tech News World

Official page of the Hadoop project

v · d · eApache Software Foundation

Top level projects

Abdera

ActiveMQ

Ant

Aries

Apache HTTP Server

APR

Avro

Axis

Buildr

Camel

Cassandra

Cayenne

Chemistry

Click

Cocoon

Continuum

CouchDB

CXF

Derby

Directory

Felix

Forrest

Geronimo

Gump

Hadoop

Hive

HBase

Jackrabbit

James

Karaf

Lenya

libcloud

Mahout

Maven

MINA

mod_perl

MyFaces

ODE

OFBiz

OpenEJB

OpenJPA

POI

Pivot

Qpid

River

Roller

ServiceMix

Shindig

Shiro

Sling

SpamAssassin

stdcxx

Struts

Subversion

Tapestry

Thrift

Tomcat

Trafficserver

Tuscany

UIMA

Velocity

Wicket

Xerces

XMLBeans

Jakarta Projects

BCEL

BSF

Cactus

JMeter

Commons Projects

Daemon

Sanselan

Jelly

Lucene Projects

Lucene Java

Droids

Lucene.Net

Lucy

Nutch

Open Relevance Project

PyLucene

Solr

Tika

Hadoop Projects

HDFS

ZooKeeper

Other projects

Chainsaw

Batik

FOP

Log4j

XAP

Log4Net

Ivy

Wink

Incubator Projects

ACE

Callback

Composer

Empire-db

Hama

JSPWiki

OpenOffice.org

XAP

Wink

Apache Attic

AxKit

Beehive

Bluesky

Excalibur

Harmony

HiveMind

Slide

Shale

iBATIS

License: Apache License

Website: apache.org

Categories:
Internet search engines
Free search engine software
Free web crawlers
Java libraries
Cross-platform software

Lucene Nutch

Screenshot Nutch Web Interface Search
Developer(s)	Apache Software Foundation
Stable release	1.3 / June 7, 2011; 5 months ago (2011-06-07)
Development status	Active
Written in	Java
Operating system	Cross-platform
Type	Search Engine
License	Apache License 2.0
Website	nutch.apache.org

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Nutch — Développeur Doug Cutting Dernière version 1.3 (7 juin 2011) [ … Wikipédia en Français
Nutch — Entwickler Apache Software Foundation Aktuelle Version 1.2 (24. September 2010) Betriebssystem Cross platform Kategorie Crawler, Parser und … Deutsch Wikipedia
Nutch — Desarrollador Apache Software Foundation http://lucene.apache.org/nutch/ Información general Última versión estable 1 … Wikipedia Español
Lucene — Developer(s) Apache Software Foundation Stable release 3.4 / September 14, 2011; 2 months ago ( … Wikipedia
Doug Cutting — Douglas Reed Cutting is an advocate and creator of open source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open source search technology projects which are now managed through the Apache Software Foundation. He… … Wikipedia
Hadoop — Apache Hadoop Тип Система для распределённых вычислений Разработчик Apache Software Foundation … Википедия
Hadoop — Infobox Software name = Apache Hadoop caption = developer = Apache Software Foundation latest release version = 0.18.0 latest release date = release date|2008|08|22 latest preview version = latest preview date = operating system = Cross platform… … Wikipedia
Frutch — est un groupe de travail visant à développer un moteur de recherche francophone, basé sur le moteur de recherche opensource Nutch. Liens externes (fr) Frutch.org Groupe de travail francophone sur Nutch (fr) Frutch.com Adresse du futur moteur de… … Wikipédia en Français
Wikia Search — Не путайте с Википедией многоязычной свободной энциклопедией Wikia Search … Википедия
Hadoop — Apache Hadoop Logotipo de Hadoop Desarrollador Apache Software Foundation http://hadoop.apache.org/ Información general … Wikipedia Español

Academic Dictionaries and Encyclopedias

Nutch

Contents

Features

History

Advantages ^[2]

Scalability

Related projects

Search engines built with Nutch

See also

References

Bibliography

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Nutch

Contents

Features

History

Advantages [2]

Scalability

Related projects

Search engines built with Nutch

See also

References

Bibliography

External links

Look at other dictionaries:

Share the article and excerpts

Direct link

Advantages ^[2]