Compressed suffix array

Compressed suffix array

The Compressed Suffix Array[1][2] is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T. For an input pattern P of m characters, the search time is equal to n times the higher-order entropy of the text T, plus some extra bits to store the empirical statistical model plus o(n).

The original instantiation of the compressed suffix array[1] solved a long-standing open problem by showing that fast pattern matching was possible using only a linear-space data structure, namely, one proportional to the size of the text T, which takes O(n \, {\log |\Sigma|}) bits. The conventional suffix array and suffix tree use \Omega(n \, {\log n}) bits, which is substantially larger. The basis for the data structure is a recursive decomposition using the "neighbor function," which allows a suffix array to be represented by one of half its length. The construction is repeated multiple times until the resulting suffix array uses a linear number of bits. Following work showed that the actual storage space was related to the zeroth-order entropy and that the index supports self-indexing.[3] The space bound was further improved using adaptive coding with longer contexts to achieve the ultimate goal of higher-order entropy.[2] The space usage is extremely competitive in practice with other state-of-the-art compressors,[4] and it also supports fast pattern matching.

The memory accesses made by compressed suffix arrays and other compressed data structures for pattern matching are typically not localized, and thus these data structures have been notoriously hard to design efficiently for use in external memory. Recent progress using geometric duality takes advantage of the block access provided by disks to speed up the I/O time significantly[5]

References

  1. ^ a b R. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching], SIAM Journal on Computing, 35(2), 2005, 378-407. An earlier version appeared in Proceedings of the 32nd ACM Symposium on Theory of Computing, May 2000, 397-406.
  2. ^ a b R. Grossi, A. Gupta, and J. S. Vitter, High-Order Entropy-Compressed Text Indexes, Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms, January 2003, 841-850.
  3. ^ K. Sadakane, Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Arrays, Proceedings of the International Symposium on Algorithms and Computation, Lecture Notes in Computer Science, vol. 1969, Springer, December 2000, 410-421.
  4. ^ L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter, Indexing Equals Compression: Experiments on Suffix Arrays and Trees, ACM Transactions on Algorithms, 2(4), 2006, 611-639.
  5. ^ W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter, On Entropy-Compressed Text Indexing in External Memory, Proceedings of the Conference on String Processing and Information Retrieval, August 2009.

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Suffix array — In computer science, a suffix array is an array giving the suffixes of a string in lexicographical order.DetailsConsider the string abracadabra , of length 11. It has eleven suffixes: abracadabra , bracadabra , racadabra , and so on down to a .… …   Wikipedia

  • Compressed data structure — The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data… …   Wikipedia

  • Suffix tree — In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many… …   Wikipedia

  • Substring index — A substring index is a data structure which gives substring search in a text or text collection in sublinear time. If you have a document S of length n, or a set of documents D={S^1,S^2, dots, S^d} of total length n, you can locate all… …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Trie — A trie for keys A , to , tea , ted , ten , i , in , and inn . In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search… …   Wikipedia

  • North American P-51 variants — Main article: North American P 51 Mustang Further information: North American A 36, F 82 Twin Mustang The North American P 51 Mustang was an American long range single seat fighter aircraft that entered service with Allied air forces in the… …   Wikipedia

  • Physical Sciences — ▪ 2009 Introduction Scientists discovered a new family of superconducting materials and obtained unique images of individual hydrogen atoms and of a multiple exoplanet system. Europe completed the Large Hadron Collider, and China and India took… …   Universalium

  • Afro-Asiatic languages — formerly Hamito Semitic languages Family of about 250 languages spoken in North Africa, parts of sub Saharan African, and the Middle East. It includes such languages as Arabic, Hebrew, Amharic, and Hausa. The total number of speakers is estimated …   Universalium

  • star — starless, adj. /stahr/, n., adj., v., starred, starring. n. 1. any of the heavenly bodies, except the moon, appearing as fixed luminous points in the sky at night. 2. Astron. any of the large, self luminous, heavenly bodies, as the sun, Polaris,… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”