Sequence clustering

Sequence clustering

In bioinformatics, sequence clustering algorithms attempt to group sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin.For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

Generally, the clustering algorithms are single-linkage clustering, constructing a transitive closure of sequences with a similarity over a particular threshold. The similarity score is often based on sequence alignment.Sequence clustering is often used to make a non-redundant set of representative sequences.

Sequence clusters are often synonymous with (but not identical to) protein families. Determining a representative tertiary structure for each "sequence cluster" is the aim of many structural genomics initiatives.

External links

Sequence clustering packages

* [http://www.ebi.ac.uk/~holm/nrdb90 RDB90 and nrdb90.pl: a nonredundant sequence database]
* [http://www.ebi.ac.uk/research/cgg/tribe/ TribeMCL: a method for clustering proteins into related groups]
* [http://bio.informatics.indiana.edu/sunkim/BAG/ BAG: a graph theoretic sequence clustering algorithm]
* [http://bioinformatics.org/cd-hit/ CD-HIT: a fast heuristic method for making non-redundant databases]
* [http://skyrah.bio.cc/RSDB/ RSDB: Representative Sequences DataBase project]
* [http://ratest.eng.uiowa.edu/pubsoft/clustering/ UICluster: Parallel Clustering of EST (Gene) Sequences]
* [http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html BLASTClust single-linkage clustering with BLAST]
* [http://web.mit.edu/polz/clusterer Clusterer: extendable java application for sequence grouping and cluster analyses]
* [http://blast.wustl.edu/blast/README.html#Manifest PATDB: a program for rapidly identifying perfect substrings]
* [http://blast.wustl.edu/pub/nrdb NRDB: a program for merging trivially redundant (identical) sequences]
* [http://www.ebi.ac.uk/clustr/ CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI]
* [http://troy.bioc.uvic.ca/tools/VOCS Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity]

Non-redundant sequence databases

* [http://www.fccc.edu/research/labs/dunbrack/pisces/ PISCES: A Protein Sequence Culling Server]
* [http://www.ebi.ac.uk/~holm/nrdb90 RDB90 and nrdb90.pl: a nonredundant sequence database]
* [http://www.uniprot.org/database/DBDescription.shtml#uniref UniRef: A non-redundant UniProt sequence database]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Sequence theory — is the study of conceptual sequences, representing unfolding steps of a sequence like a recipe or an algorithm. A successful sequence is one which is backtrack free.HistorySequence theory is related to various fields within mathematics and… …   Wikipedia

  • Clustering illusion — The clustering illusion refers to the tendency erroneously to perceive small samples from random distributions to have significant streaks or clusters , caused by a human tendency to underpredict the amount of variability likely to appear in a… …   Wikipedia

  • Clustering illusion — Illusion des séries L illusion des séries (en anglais Clustering illusion) est la tendance à percevoir à tort des coincidences dans des données au hasard. Cela est du à la sous estimation systématique par l esprit humain de la variabilité des… …   Wikipédia en Français

  • Complete-linkage clustering — In cluster analysis, complete linkage or farthest neighbour is a method of calculating distances between clusters in agglomerative hierarchical clustering. In complete linkage,[1] the distance between two clusters is computed as the maximum… …   Wikipedia

  • Single-linkage clustering — In cluster analysis, single linkage, nearest neighbour or shortest distance is a method of calculating distances between clusters in hierarchical clustering. In single linkage, the distance between two clusters is computed as the distance between …   Wikipedia

  • Data stream clustering — In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied under the data stream… …   Wikipedia

  • Primary clustering — is the tendency for certain open addressing hash tables collision resolution schemes to create long sequences of filled slots. It is most commonly referred to in the context of problems with linear probing.These long chains degrade the hash table …   Wikipedia

  • Multiple sequence alignment — A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they… …   Wikipedia

  • Expressed sequence tag — An expressed sequence tag or EST is a short sub sequence of a transcribed spliced nucleotide sequence (either protein coding or not). They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence… …   Wikipedia

  • Microsoft SQL Server — Developer(s) Microsoft Stable release SQL Server 2008 R2 (10.50.2500.0 Service Pack 1) / July 11, 2011; 4 months ago …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”