Suffix array

Suffix array

In computer science, a suffix array is an array giving the suffixes of a string in lexicographical order.

Details

Consider the string "abracadabra", of length 11. It has eleven suffixes: "abracadabra", "bracadabra", "racadabra", and so on down to "a". Sorted into lexicographical order, these suffixes are

a abra abracadabra acadabra adabra bra bracadabra cadabra dabra ra racadabra

If the original string is available, each suffix can be completely specified by the index of its first character. The suffix array is the array of these indices in lexicographical order. For the string "abracadabra", using one-based indexing, the suffix array is {11,8,1,4,6,9,2,5,7,10,3}, because the suffix "a" begins at the 11th character, "abra" begins at the 8th character, and so forth.

One may also treat the string "abracadabra" as having a twelfth suffix, the empty string (with index 12). But since the empty string will always sort lexicographically before every other suffix, we lose no information by omitting it.

Algorithms

The easiest way to construct a suffix array is to use an efficient comparison sort algorithm. This requires O(n log n) suffix comparisons, but a suffix comparison requires O(n) time, so the overall runtime of this approach is O(n^2 log n). More sophisticated algorithms improve this to O(n log n) by exploiting the results of partial sorts to avoid redundant comparisons. Several Theta(n) algorithms have also been developed which provide faster construction and have space usage of O(n) with low constants.Fact|date=June 2008

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(m log n) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + log n) search time.

Suffix sorting algorithms can be used to perform the Burrows-Wheeler transform (BWT). Technically the BWT requires sorting cyclic permutations of a string, not suffixes. We can fix this by appending to the string a special end-of-string character which sorts lexicographically before every other character. Sorting cyclic permutations is then equivalent to sorting suffixes.

History

Suffix arrays were originally developed by Gene Myers and Udi Manber to reduce memory consumption compared to a suffix tree. This began the trend towards compressed suffix arrays and BWT-based compressed full-text indices.

ee also

* Suffix tree

References

* Udi Manber and Gene Myers (1991). "Suffix arrays: a new method for on-line string searches". "SIAM Journal on Computing", Volume 22, Issue 5 (October 1993), pp. 935-948.
* Pang Ko and Srinivas Aluru (2003). "Space efficient linear time construction of suffix arrays." In "Combinatorial Pattern Matching (CPM 03)". LNCS 2676, Springer, 2003, pp 203-210.
* Juha Kärkkäinen and Peter Sanders (2003). "Simple linear work suffix array construction." In "Proc. 30th International Colloquium on Automata, Languages and Programming (ICALP '03)". LNCS 2719, Springer, 2003, pp. 943-955.
* Klaus-Bernd Schürmann and Jens Stoye (2005). "An incomplex algorithm for fast suffix array construction". In "Proceedings of the 7th Workshop on Algorithm Engineering and Experiments and the 2nd Workshop on Analytic Algorithmics and Combinatorics (ALENEX/ANALCO 2005)", 2005, pp. 77-85.

External links

* [http://geocities.com/malbrain/bwtsort_c.html Suffix sorting module for BWT in C code]
* [http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845 Suffix Array Implementation in Ruby]
* [http://sary.sourceforge.net/ Suffix array library and tools]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Suffix-Array — In der Informatik ist ein Suffixarray ein Array, das die Suffixe eines Strings in lexikographischer Reihenfolge angibt. Inhaltsverzeichnis 1 Details 2 Algorithmen 3 Anwendungen 4 Geschichte …   Deutsch Wikipedia

  • Compressed suffix array — The Compressed Suffix Array[1][2] is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T. For an input pattern P of m… …   Wikipedia

  • Suffix tree — In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many… …   Wikipedia

  • Generalised suffix tree — A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings D=S 1,S 2,dots,S d of total length n, it is a Patricia trie containing all n suffixes of the strings. It is mostly used in bioinformaticsref|BRCR.… …   Wikipedia

  • List of terms relating to algorithms and data structures — The [http://www.nist.gov/dads/ NIST Dictionary of Algorithms and Data Structures] is a reference work maintained by the U.S. National Institute of Standards and Technology. It defines a large number of terms relating to algorithms and data… …   Wikipedia

  • Список терминов, относящихся к алгоритмам и структурам данных —   Это служебный список статей, созданный для координации работ по развитию темы.   Данное предупреждение не устанавливается на информационные списки и глоссарии …   Википедия

  • Список терминов — Список терминов, относящихся к алгоритмам и структурам данных   Это сл …   Википедия

  • Suffixarray — In der Informatik ist ein Suffixarray ein Array, das die Suffixe eines Strings in lexikographischer Reihenfolge angibt. Inhaltsverzeichnis 1 Details 2 Algorithmen 3 Anwendungen …   Deutsch Wikipedia

  • Joshua MT Tool — is an open source tool for statistical machine translation which is parsing based. The toolkit achieves state of the art translation performance on the French English translation task. Contenido 1 History 2 Goals 3 Main functions implemented in… …   Wikipedia Español

  • Compressed data structure — The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”