- Stockholm format
Stockholm format is a
Multiple sequence alignment format used byPfam andRfam to disseminate protein and RNA sequence alignmentscite journal |author=Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A |title=Rfam: annotating non-coding RNAs in complete genomes. |journal=Nucleic Acids Res |volume=33 |issue=Database issue |pages=D121-4 |year=2005 |pmid=15608160] cite journal |author=Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR |title=Rfam: an RNA family database. |journal=Nucleic Acids Res |volume=31 |issue=1 |pages=439-41 |year=2003 |pmid=12520045] cite journal |author=Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A |title=The Pfam protein families database. |journal=Nucleic Acids Res |volume=36 |issue=Database issue |pages=D281-8 |year=2008 |pmid=18039703] . The alignment editors [http://personalpages.manchester.ac.uk/staff/sam.griffiths-jones/software/ralee/ Ralee] and [ftp://ftp.cgb.ki.se/pub/prog/belvu Belvu] support Stockholm format as do the probabilistic database search tools, [http://infernal.janelia.org/ Infernal] andHMMER . A simple example of an Rfam alignment (UPSK RNA ) in Stockholm format is shown below:
# STOCKHOLM 1.0AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCGM24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCGJ04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCGM24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons .AAA....<<<>>>// A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:
'
' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-". The alignment mark-up:
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.
#=GF
#=GC
#=GS
#=GRMagic or recommended features:
#=GF
(See [ftp://selab.janelia.org/pub/Pfam/userman.txt Pfam documentation,] under "Description of fields")
For embedding trees:
#=GF NH
#=GF TN* Notes: A tree may be stored on multiple #=GF NH lines.
* If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.#=GS
Rfam and Pfam uses these features:
Feature Description --------------------- ----------- ACACcession number DE DEscription DR ; ; Database Reference OS OrganiSm (species) OC Organism Classification (clade, etc.) LO Look (Color, etc.) #=GR
Feature Description Markup letters ------- ----------- -------------- SS Secondary Structure For RNA [.,;<>(){} [] AaBb...] , For protein [HGIEBTSCX] SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%) TM TransMembrane [Mio] PP Posterior Probability [0-9*] (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00) LI LIgand binding [*] AS Active Site [*] IN INtron (in or after) [0-2]#=GC
The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".
Notes:
*Do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
*"X" in SA and SS means "residue with unknown structure".
*The protein SS letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)
*The RNA SS letters are taken from WUSS (Washington University Secondary Structure) notation. Matching nested parentheses characters <>, (), [] , or {} indicate a basepair. The symbols '.', ',' and ';' indicate unpaired regions and matched upper and lower case characters from the
English alphabet indicatepseudoknot interactions.Recommended placements:
* #=GF Above the alignment
* #=GC Below the alignment
* #=GS Above the alignment or just below the corresponding sequence
* #=GR Just below the corresponding sequenceize limits:
*No size limits on any field.
*However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:
** Line length: 10000.
**: 255.
**: 255. References
ee also
*
FASTA format
*Rfam
*Pfam External links
* [http://sonnhammer.sbc.su.se/Stockholm.html Erik Sonnhammers' definition of Stockholm format]
Wikimedia Foundation. 2010.