Hirschberg's algorithm

Hirschberg's algorithm

Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the least cost sequence alignment between two
strings, where cost is measured as Levenshtein edit distance, defined to be the sumof the costs of insertions, replacements, deletions, and null actions needed to changeone string to the other. Hirschberg's algorithm is simply described as a divide and conquer version of the Needleman-Wunsch algorithm [ [http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/ Hirschberg's algorithm ] ] . Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Algorithm information

Hirschberg's algorithm is a generally applicablealgorithm for finding an optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If "x" and "y" are strings, where |"x"| = "n" and |"y"| = "m", the Needleman-Wunsch algorithm finds an optimal alignment in O("nm") time, using O("nm") space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which takes O("nm") time, but needs only O(min{"m","n"}) space.

One application of the algorithm is finding sequence alignments of DNA or protein sequences.


=Computation of the Levenshtein edit distance in linear space=

The edit distance Edit("x","y") is the least cost of changing "x" into "y" by using the operations Insert, Substitute, and Delete, where the costof each kind of operation is given. We write Ins("a") for the cost ofinserting a symbol "a" into a string, Sub("a","b") for the cost of substituting a symbol "b" for "a" in a string, and Del("a") for the cost of deleting a symbol "a" from a string. The "standard" choice of those costs is Ins("a") = Del("a") = 1 for each symbol "a", Sub("a","a") = 0, and Sub("a","b") = 1 if "a" is not equal to "b".The Needleman-Wunsch algorithm computes Edit("x","y") by computingEdit(Pref ["x","i"] ,Pref ["y","j"] ) for all "i" and "j" dynamically, where Pref ["x","i"] denotes the prefix of "x" of length "i". That algorithm requires O("nm") time and O("nm") space, where "n" = |"x"| and "m" = |"y"|.

Algorithm organization

To understand Hirschberg's algorithm, it is important to first understandthat edit distances can be computed using linear space.

What we call the "forward subprogram" computes the values ofEdit(Pref ["x","i"] ,Pref ["y","j"] ) for all "i" and "j", just as the Needleman-Wunschand returns the array {Edit("x",Pref ["y","j"] )}0 ≤ j ≤ m. The"backward subprogram" is similar, except that the dynamic programis done in the opposite direction, i.e., starting from the right endsof the strings. It returns the array {Edit("x",Suff ["y","j"] )}0 ≤ j ≤ m,where Suff ["y","j"] is the suffix of "y" of length "j".

The forward and backward subprograms compute values in the matrix by usingpreviously computed values, but save space by erasing values that will nolonger be needed during that execution of the subprogram. Unfortunately,the erased values will need to be recomputed later; thus, Hirschberg'salgorithm takes roughly twice as much time as Needleman-WunschFact|date=April 2007.

Pseudocode

01 High-level description of the forwards subprogram 02 03 Forwards [x,y] is 04 05 1. n = |x|; m = |y
06 2. For all i from 1 to n: 07 Edit [Pref [x,i] ,Pref [y,0] = 0 08 3. For all j from 1 to m: 09 A. Edit [Pref [x,0] ,Pref [y,j] = Edit [Pref [x,0] ,Pref [y,j-1] + Ins(y_j) 10 B. For all i from 1 to n, execute the following steps: 11 i. Edit [Pref [x,i] ,Pref [y,j] = 12 min{Edit [Pref [x,i-1] ,Pref [y,j] + Del(x_i), 13 Edit [Pref [x,i-1] ,Pref [y,j-1] + Sub(x_i,y_j), 14 Edit [Pref [x,i] ,Pref [y,j-1] + Ins(y_j)} 15 ii. Erase Edit [Pref [x,i-1] ,Pref [y,j-1] 16 C. Erase Edit [Pref [x,i-1] ,Pref [y,j] 17 4. Return {Edit [Pref [x,n] ,Pref [y,j] } %% an array of length m+1

01 High-level description of the backwards subprogram 02 03 Backwards [x,y] is 04 05 1. n = |x|; m = |y
06 2. For all i from 1 to n: 07 Edit [Suff [x,i] ,Suff [y,0] = 0 08 3. For all j from 1 to m: 09 A. Edit [Suff [x,0] ,Suff [y,j] = Edit [Suff [x,n] ,Suff [y,j-1] + Ins(y_{m-j+1}) 10 B. For all i from 1 to n: 11 i. Edit [Suff [x,i] ,Suff [y,j] = 12 min{Edit [Suff [x,i-1] ,Suff [y,j] + Del(x_{n-i-1}), 13 Edit [Suff [x,i-1] ,Suff [y,j-1] + Sub(x_{n-i-1},y_{m-j+1}), 14 Edit [Suff [x,i] ,Suff [y,j-1] + Ins(y_{m-j+1})} 15 ii. Erase Edit [Suff [x,i-1] ,Suff [y,j-1] 16 C. Erase Edit [Suff [x,i-1] ,Suff [y,j] 17 4. RETURN {Edit [Pref [x,n] ,Pref [y,j] } %% an array of length m+1

01 High level description of Hirschberg's algorithm: 02 03 Hirschberg(x,y) is 04 05 1. n = |x|; m = |y
06 2. If n <= 1 or m <= 1: 07 OUTPUT Alignment(x,y) using Needleman-Wunsch. 08 Else: 09 A. mid = floor(n/2) 10 B. x^left = Pref [x,mid] 11 C. x^right = Suff [x,n-mid] 12 D. {Edit [x^left,Pref [y,j] } = Forwards(x^left,y) %% an array of length m+1 13 E. {Edit [x^right,Suff [y,j] } = Backwards(x^right,y) %% an array of length m+1 14 F. cut = ArgMin{Edit [x^left,Pref [y,cut] + Edit [x^right,Suff [y,n-cut] } 15 G. Hirschberg(x^left,Pref [y,cut] ) 16 H. Hirschberg(x^right,Suff [y,n-cut] )

ee also

* Needleman-Wunsch algorithm
* Smith Waterman algorithm
* Levenshtein distance

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Hirschberg-Algorithmus — Der Hirschberg Algorithmus berechnet das paarweise Sequenzalignment und hat einen zur Eingabe linearen Speicherbedarf. Der in 1970er Jahren von Dan Hirschberg entwickelte Algorithmus verwendet die Methode der Dynamischen Programmierung und das… …   Deutsch Wikipedia

  • Dan Hirschberg — Daniel S. Hirschberg Dan Hirschberg Institutions University of California, Irvine …   Wikipedia

  • Needleman–Wunsch algorithm — The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and… …   Wikipedia

  • Needleman-Wunsch algorithm — The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul Needleman and Christian …   Wikipedia

  • Package-merge algorithm — The Package Merge Algorithm is an O(nL) time algorithm for finding anoptimal length limited Huffman code for a given distribution on a given alphabet of size n , where no code word is longer than L . It is a greedy algorithm, and a generalization …   Wikipedia

  • HS algorithm — The HS Algorithm is named after Dan Hirschberg and J. B. Sinclair. It is a distributed algorithm designed for the Leader Election problem in a Synchronous Ring.The algorithm requires the use of unique IDs (UID) for each process. The algorithm… …   Wikipedia

  • Longest common subsequence problem — Not to be confused with longest common substring problem. The longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences (often just two). Note that subsequence is different from a… …   Wikipedia

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia

  • List of mathematics articles (H) — NOTOC H H cobordism H derivative H index H infinity methods in control theory H relation H space H theorem H tree Haag s theorem Haagerup property Haaland equation Haar measure Haar wavelet Haboush s theorem Hackenbush Hadamard code Hadamard… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”