- Search engine technology
Modern web search engines are complex software systems using the technology that has evolved over the years. The largest search engines such as
Google andYahoo! utilize tens or hundreds of thousands of computers to process billions of web pages and return results for thousands of searches per secondFact|date=February 2007. High volume of queries and text processing requires the software to run in highly distributed environment with high degree of redundancyFact|date=February 2007. Modern search engines have the following main components:Crawl
The first step in preparing
web page s for search is to find and index them. In the past, search engines started with a small list of URLs as seed list, fetched the content, parsed for the links on those pages, fetched the web pages pointed to by those links which provided new links and the cycle continued until enough pages were foundFact|date=February 2007. Most modern search engines now utilize a continuous crawl method rather than discovery based on a seed listFact|date=November 2007. The continuous crawl method is just an extension of discovery method but there is no seed list because the crawl never stopsFact|date=February 2007. The current list of pages is visited on regular intervals and new pages are found when links are added or deleted from those pagesFact|date=February 2007. Many search engines use sophisticated scheduling algorithms to decide when to revisit a particular page. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity and overall quality of site, speed of web server serving the page and resource constraints like amount of hardware and bandwidth of Internet connection. Search engines crawl many more pages than they make available for searching because crawler find lots duplicate content pages on the web and many pages don't have useful content. Duplicate and useless content often represents more than half the pages available for indexing.Link Map
Pages discovered by crawlers are fed into (often distributed) service that creates a link map of the pages. Link map is a graph structure in which pages are represented as nodes connected by the links among those pages. This data is stored in
data structure s that allow fast access to the data by certainalgorithm s which compute the popularity score of pages on the web, essentially based on how many links point to a web page and the quality of those links. One such algorithm,PageRank , proposed by Google foundersLarry Page andSergey Brin , is well known and has attracted a lot of attention. The idea of doing link analysis to compute a popularity rank is older than PageRank and many variants of the same idea are currently in use. These ideas can be categorized in three main categories: rank of individual pages, rank of web sites, and nature of web site content (Jon Kleinberg 's HITS algorithm). Search engines often differentiate betweeninternal link s andexternal link s, with the assumption that links on a page pointing other pages on the same site are less valuable because they are often created by web site owners to artificially increase the rank of their web sites and pages. Link map data structures typically also store the anchor text embedded in the links because anchor text often provides a very good quality short-summary of a web page's content.Index
Indexing is the process of extracting text from web pages, tokenizing it and then creating an index structure (
inverted index ) that can be used to quickly find which pages contain a particular word. Search engines differ quite a lot in tokenization process. The issues involved in tokenization are: detecting the encoding used for the page, determining the language of the content (some pages use multiple languages), finding word, sentence and paragraph boundaries, combining multiple adjacent-words into one phrase and changing the case of text andstemming the words into their roots (lower-casing and stemming is applicable only to some languages). This phase also decides which sections of page to index and how much text from very large pages (such as technical manuals) to index. Search engines also differ in the document formats they interpret and extract the text from.Some search engines go through the indexing process every few weeks and refresh the complete index used for web search requests while others keep updating small fragments of the index continuouslyFact|date=February 2007. Before web pages can be indexed, an algorithm decides which node (a server in a distributed service) will index any given page and makes the information available as
metadata for other components in the search engine. The index structure is complex and typically employs some compression algorithmFact|date=February 2007. The choice of compression algorithm involves a trade-off between on-disk storage space and speed of decompression when needed to satisfy search requests. The largest search engines use thousands of computers to index pages in parallelFact|date=February 2007.ee also
*
Search engine
*Web crawler
*Search engine indexing
Wikimedia Foundation. 2010.