- Sequence clustering
In
bioinformatics , sequence clusteringalgorithm s attempt to group sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) orprotein origin.For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the samegene before the ESTs are assembled to reconstruct the originalmRNA .Generally, the clustering algorithms are
single-linkage clustering , constructing atransitive closure of sequences with a similarity over a particular threshold. The similarity score is often based onsequence alignment .Sequence clustering is often used to make a non-redundant set ofrepresentative sequences .Sequence clusters are often synonymous with (but not identical to) protein families. Determining a representative
tertiary structure for each "sequence cluster" is the aim of manystructural genomics initiatives.External links
Sequence clustering packages
* [http://www.ebi.ac.uk/~holm/nrdb90 RDB90 and nrdb90.pl: a nonredundant sequence database]
* [http://www.ebi.ac.uk/research/cgg/tribe/ TribeMCL: a method for clustering proteins into related groups]
* [http://bio.informatics.indiana.edu/sunkim/BAG/ BAG: a graph theoretic sequence clustering algorithm]
* [http://bioinformatics.org/cd-hit/ CD-HIT: a fast heuristic method for making non-redundant databases]
* [http://skyrah.bio.cc/RSDB/ RSDB: Representative Sequences DataBase project]
* [http://ratest.eng.uiowa.edu/pubsoft/clustering/ UICluster: Parallel Clustering of EST (Gene) Sequences]
* [http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html BLASTClust single-linkage clustering with BLAST]
* [http://web.mit.edu/polz/clusterer Clusterer: extendable java application for sequence grouping and cluster analyses]
* [http://blast.wustl.edu/blast/README.html#Manifest PATDB: a program for rapidly identifying perfect substrings]
* [http://blast.wustl.edu/pub/nrdb NRDB: a program for merging trivially redundant (identical) sequences]
* [http://www.ebi.ac.uk/clustr/ CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI]
* [http://troy.bioc.uvic.ca/tools/VOCS Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity]Non-redundant sequence databases
* [http://www.fccc.edu/research/labs/dunbrack/pisces/ PISCES: A Protein Sequence Culling Server]
* [http://www.ebi.ac.uk/~holm/nrdb90 RDB90 and nrdb90.pl: a nonredundant sequence database]
* [http://www.uniprot.org/database/DBDescription.shtml#uniref UniRef: A non-redundant UniProt sequence database]
Wikimedia Foundation. 2010.