Inverted index

Inverted index

In information technology, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents, in this case allowing full text search. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems. [Harvnb |Zobel|Moffat|Ramamohanarao|1998| Ref=none ] Several significant general-purpose mainframe-based database management systems have used inverted list architectures, including ADABAS, DATACOM/DB, and Model 204.

There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document.Harvnb |Baeza-Yates|Ribeiro-Neto|1999| p=192 | Ref=BYR99 ] The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

Example

Given the texts T_0="it is what it is",T_1="what is it" andT_2="it is a banana", we have the following inverted file index (where the integers in the set notation brackets refer to the subscripts of the text symbols, T_0, T_1 etc.):

"a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}

A term search for the terms"what", "is" and "it" would give the set{0,1} cap {0,1,2} cap {0,1,2} = {0,1}.

With the same texts, we get the following full inverted index, where the pairs are document numbers and local word numbers. Like the document numbers, local word numbers also begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (T_2), and it is the fourth word in that document (position 3).

"a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur consecutively only in document 1.

Applications

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the query can now be resolved by jumping to the word id (via random access) in the inverted index. Random access is generally regarded as being faster than sequential access.

In pre-computer times, concordances to important books were manually assembled. These were effectively inverted indexes with a small amount of accompanying commentary, that required a tremendous amount of effort to produce.

ee also

*Vector space model
*Index (search engine)

Bibliography

*cite book |last= Knuth |first= D. E. |authorlink= Donald Knuth |title= The Art of Computer Programming |publisher= Addison-Wesley |edition= Third Edition |year= 1997 |origyear= 1973 |location= Reading, Massachusetts |isbn= 0-201-89685-0 |ref= Knu97

*cite journal|last= Zobel |first= Justin |coauthors= Moffat, Alistair; Ramamohanarao, Kotagiri |year= 1998 |month= December |title= Inverted files versus signature files for text indexing |journal= ACM Transactions on Database Systems |volume= 23 |issue= 4 |pages=pp. 453–490 |publisher= Association for Computing Machinery |location= New York |issn= 0362-5915 |doi= 10.1145/296854.277632 |url= |accessdate= 2005-11-10

*cite journal|last= Zobel |first= Justin RMIT University, Australia |coauthors= Moffat, Alistair The University of Melbourne, Australia |year= 2006 |month= July |title= Inverted Files for Text Search Engines |journal= ACM Computing Surveys |volume= 38 |issue= 2 |pages= 6|publisher= Association for Computing Machinery |location= New York |issn= 0360-0300 |doi= 10.1145/1132956.1132959 |url= |accessdate= 2007-01-26

*cite book |last= Baeza-Yates | first = Ricardo |coauthors=Ribeiro-Neto, Berthier |title= Modern information retrieval |publisher= Addison-Wesley Longman |location= Reading, Massachusetts |year= 1999 |pages= p. 192 |isbn= 0-201-39829-X |oclc= |doi= |ref= BYR99

References

*

External links

* [http://www.nist.gov/dads/HTML/invertedIndex.html NIST's Dictionary of Algorithms and Data Structures: inverted index]
* [http://mg4j.dsi.unimi.it Managing Gigabytes for Java] a free full-text search engine for large document collections written in Java.
* [http://lucene.apache.org/java/docs/] Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • inverted index — noun An indexing algorithm which indexes the document being searched based on the keywords being searched …   Wiktionary

  • inverted — index disordered, inverse, opposite Burton s Legal Thesaurus. William C. Burton. 2006 …   Law dictionary

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Inverted snub dodecadodecahedron — Type Uniform star polyhedron Elements F = 84, E = 150 V = 60 (χ = −6) Faces by sides 60{3}+12{5}+12{5/2} …   Wikipedia

  • Index (typography) — ☞ Index Punctuation apostrophe ( ’ …   Wikipedia

  • Index of chess articles — Contents 1 Books 2 General articles 2.1 0–9 2.2 A …   Wikipedia

  • Reverse index — The term reverse index has more than one meaning. Search Engines When a search engine tabulates all documents that contain a given word, that is called a reverse index. This is in contrast to a regular index, which contains the locations of all… …   Wikipedia

  • Giant Inverted Boomerang — An overview of Goliath, when it was at Six Flags Magic Mountain. Status In production Cost €20 …   Wikipedia

  • Great inverted snub icosidodecahedron — Type Uniform star polyhedron Elements F = 92, E = 150 V = 60 (χ = 2) Faces by sides (20+60){3}+12{5/2} …   Wikipedia

  • Medial inverted pentagonal hexecontahedron — Type Star polyhedron Elements F = 60, E = 150 V = 84 (χ = −6) Symmetry group …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”