Smith-Waterman algorithm

Smith-Waterman algorithm

The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Background

The algorithm was first proposed by Temple Smith and Michael Waterman in 1981.cite journal |author=Smith TF, Waterman MS |title=Identification of Common Molecular Subsequences |journal=Journal of Molecular Biology |volume=147 |pages=195–197 |year=1981 |url=http://gel.ym.edu.tw/~chc/AB_papers/03.pdf |doi=10.1016/0022-2836(81)90087-5] Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). The main difference to the Needleman-Wunsch algorithm is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Backtracing starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment.

One motivation for local alignment is the difficulty of obtaining correct alignments in regions of low similarity between distantly related biological sequences, because mutations have added too much 'noise' over evolutionary time to allow for a meaningful comparison of those regions. Local alignment avoids such regions altogether and focuses on those with a positive score, i.e. those with an evolutionary conserved signal of similarity. A prerequisite for local alignment is a negative expectation score. The expectation score is defined as the average score that the scoring system (substitution matrix and gap penalties) would yield for a random sequence.

Another motivation for using local alignments is that there is a reliable statistical model (developed by Karlin and Altschul) for optimal local alignments. The alignment of unrelated sequences tends to produce optimal local alignment scores which follow an extreme value distribution. This property allows programs to produce an expectation value for the optimal local alignment of two sequences, which is a measure of how often two unrelated sequences would produce an optimal local alignment whose score is greater than or equal to the observed score. Very low expectation values indicate that the two sequences in question might be homologous, meaning they might share a common ancestor.

However, the Smith-Waterman algorithm is fairly demanding of time and memory resources: in order to align two sequences of lengths "m" and "n", "O(mn)" time and space are required. As a result, it has largely been replaced in practical use by the BLAST algorithm; although not guaranteed to find optimal alignments, BLAST is much more efficient.

An implementation of the Smith-Waterman Algorithm, SSEARCH, is available in the FASTA sequence analysis package from [http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml] . This implementation includes Altivec accelerated code for PowerPC G4 and G5 processors that speeds up comparisons 10 - 20-fold, using a modification of the Wozniak, 1997 approachcite journal |author=Wozniak A |title=Using video-oriented instructions to speed up sequence comparison |journal=Comput Appl Biosci |volume=13 |issue=2 |pages=145–50 |year=1997 |url=http://bioinformatics.oxfordjournals.org/cgi/reprint/13/2/145.pdf] , and an SSE2 vectorization developed by Farrar cite journal |author=Farrar M |title=Striped Smith–Waterman speeds database searches six times over other SIMD implementations |journal=Bioinformatics |volume=23 |pages=156–161 |year=2007 |url=http://bioinformatics.oxfordjournals.org/cgi/reprint/23/2/156.pdf |doi=10.1093/bioinformatics/btl582 |pmid=17110365] making optimal protein database searches quite practical.

Accelerated versions

FPGA

Other recent work developed by Cray demonstrates acceleration of the Smith-Waterman algorithm using a reconfigurable computing platform based on FPGA chips.Cray Computer Corp, " [http://www.cray.com/downloads/SmithWaterman.pdf Smith-Waterman Solution for Life Sciences] ".] The results show up to 28x speed-up over standard microprocessor-based solutions. An FPGA based version of the Smith-Waterman algorithm shows FPGA (Virtex-4) speedups up to 100x [FPGA 100x Papers: [http://ft.ornl.gov/~olaf/pubs/OlafRSSI2July07.pdf] , [http://ft.ornl.gov/~olaf/pubs/CUG07Olaf17M07.pdf] , and [http://ft.ornl.gov/~olaf/pubs/RSSIOlafDave.pdf] ] over a 2.2 GHz Opteron processor.Progeniq Pte Ltd, " [http://www.progeniq.com/news/BioBoost%20White%20Paper.pdf White Paper - Accelerating Intensive Applications at 10x-50x Speedup to Remove Bottlenecks in Computational Workflows] ".]

GPU

Recent work developed at Lawrence Livermore National Laboratory and the US Department of Energy's Joint Genome Institute accelerates Smith-Waterman local sequence alignment searches using graphics processing units (GPUs) with preliminary results showing a 2x speed-up over software implementations. A similar method has already been implemented in the Biofacet software since 1997, with the same speed-up factor. [cite web |url=http://www.genomequest.com/contact-bioinformatics-ht.html |title=Bioinformatics High Throughput Sequence Seatch and Analysis (white paper) |accessdate=2008-05-09 |publisher=GenomeQuest]

A GPGPU implementation of the algorithm in the CUDA language by NVIDIA is also available.cite journal |author=Manavski SA, Valle G |title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment |journal=BMC Bioinformatics |volume=9 |issue=Suppl 2:S10 |year=2008 |url=http://www.biomedcentral.com/1471-2105/9/S2/S10 |doi=10.1186/1471-2105-9-S2-S10 |pages=S10] The performance tests on this solution show up to a 30 fold speed increase when compared to reference CPU-based implementations on the same machine.

E

In 2000, a fast implementation of the Smith-Waterman algorithm using the SIMD technology available in Intel Pentium MMX processors and similar technology was described in a publication by Rognes and Seebergcite journal|author=Rognes T and Seeberg E |title=Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors |journal= [http://bioinformaticsARRAYxfordjournalsARRAYrg/ Bioinformatics] |volume=16|pages=699–706|year=2000 |url=http://bioinformatics.oxfordjournals.org/cgi/reprint/16/8/699.pdf] . In contrast to the Wozniak (1997) approach, the new implementation was based on vectors parallel with the query sequence, not diagonal vectors. The company [http://www.sencel.com/ Sencel Bioinformatics] has applied for a patent covering this approach. Sencel is developing the software further and provides executables for academic use free of charge.

A SSE2 vectorization of the algorithm (Farrar, 2007) is now available providing an 8-fold speedup on Intel/AMD processors with SSE2 extensions. When running on Intel processor using the new Intel Core microarchitecture the SSE2 implementation achieves a 20-fold increase.

Danish bioinformatics company CLC bio has achieved speed-ups of close to 200 over standard software implementations with SSE2 on a Intel 2.17 GHz Core 2 Duo CPU, according to a [http://www.clccell.com/download.html publicly available white paper] .

Accelerated version of the Smith-Waterman algorithm, on Intel and AMD based Linux servers, is supported by the [http://www.biocceleration.com/GenCore6-General.html GenCore 6] package, offered by [http://www.biocceleration.com Biocceleration] . Performance benchmarks of this software package show up to 10 fold speed acceleration relative to standard software implementation on the same processor.

Currently the only company in bioinformatics to offer both SSE and FPGA solutions accelerating Smith-Waterman, CLC bio has achieved speed-ups of more than 110 over standard software implementations with [http://www.clccube.com CLC Bioinformatics Cube] .

The [http://www.timelogic.com TimeLogic] DeCypher and CodeQuest systems also accelerate Smith-Waterman and Framesearch using FPGA technology.

Cell Broadband Engine

A papercite journal|author=Farrar M S |title=Optimizing Smith-Waterman for the Cell Broadband Engine |year=2008 |url=http://farrar.michael.googlepages.com/smith-watermanfortheibmcellbe] describing a port of the Striped Smith-Waterman to the Cell Broadband Engine reports speeds of 32 GCUPS on an IBM QS20 blade and 12 GCUPS on a Sony PlayStation 3.

References

External links

* [http://jaligner.sourceforge.net/ JAligner] — an open source Java implementation of the Smith-Waterman algorithm
* [http://baba.sourceforge.net/ B.A.B.A.] — an applet (with source) which visually explains the algorithm.
* [http://www.ebi.ac.uk/Tools/fasta FASTA/SSEARCH at the] EBI's FASTA/SSEARCH services page.

ee also

*BLAST
*FASTA
*Levenshtein distance
*Needleman-Wunsch algorithm


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • Waterman — may refer to:*Watermen, river workers who transferred passengers across and along the city centre rivers in Britain * Watermen, harvesters of seafood in the Chesapeake Bay region of the United States * The term waterman (or, more recently,… …   Wikipedia

  • Michael Waterman — Born June 28, 1942 …   Wikipedia

  • Temple F. Smith — is a university professor in BioMedical Engineering who helped to develop the Smith Waterman algorithm developed with Michael Waterman in 1981. The Smith Waterman algorithm serves as the basis for multi sequence comparisons, identifying the… …   Wikipedia

  • Needleman–Wunsch algorithm — The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and… …   Wikipedia

  • Hirschberg's algorithm — Hirschberg s algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the least cost sequence alignment between two strings, where cost is measured as Levenshtein edit distance, defined to be the sumof… …   Wikipedia

  • Needleman-Wunsch algorithm — The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul Needleman and Christian …   Wikipedia

  • BLAST — Infobox Software name=BLAST developer=Myers, E., Altschul S.F., Gish W., Miller E.W., Lipman D.J., NCBI latest release version=2.2.18 operating system=UNIX, Linux, Mac, MS Windows genre=Bioinformatics tool license=Public Domain website=… …   Wikipedia

  • FASTA — is a DNA and Protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 in the article [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve db=pubmed dopt=Abstract list… …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia

  • List of mathematics articles (S) — NOTOC S S duality S matrix S plane S transform S unit S.O.S. Mathematics SA subgroup Saccheri quadrilateral Sacks spiral Sacred geometry Saddle node bifurcation Saddle point Saddle surface Sadleirian Professor of Pure Mathematics Safe prime Safe… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”