FASTA format

FASTA format

In bioinformatics, FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes.The format also allows for sequence names and comments to precede the sequences.

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Python and Perl.

Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY

Format converters

FASTA files can be batch converted to or from MultiFASTA format using tools some of which are available as freeware. Tools are also available for batch conversion from chromatogram formats (ABI/SCF) to FASTA.

Header line

The header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character (as in [ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz] ).

In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Most databases and bioinformatics applications do not recognize these comments and follow [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml the NCBI FASTA specification] . An example of a multiple sequence FASTA file follows:

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

equence representation

After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.

The nucleic acid codes supported are:

Example Header Block:


#Dbcomponent=1
#Name=UniProt_SwissProt
#PrimaryIdentifierType=sp_ac
#Version=52.3
#ReleaseDate=20070425
#NumberOfEntries=248942
#Sequence_type=Protein_sequence

#Dbcomponent=2
#Name=ENSEMBL
#PrimaryIdentifierType=sp_ac
#Version=12.45.3.2
#ReleaseDate=20070425
#NumberOfEntries=1234567
#Sequence_type=Protein_sequence

equence header line

Example Protein Entry:

>sp_ac|P02769_WOSIG0 ID=ALBU_BOVIN DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" NCBITAXID=9913 MODRES=(1|Acetyl) VARIANT=(196|A|T) LENGTH=589RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

ee also

*FASTA Search
*Stockholm format

External links

* [http://www.proteomecommons.org/data/fasta/hupo_standard.jsp HUPO-PSI Standard FASTA Format] describes another FASTA format as put forward by the Human Proteome Organisation's Proteomics Standards Initiative.
* [http://iubio.bio.indiana.edu/soft/molbio/readseq/ Readseq] for converting sequence formats to FASTA - Not updated since 1999. Needs Java.
* [http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi Readseq online at IUBio] -- [http://searchlauncher.bcm.tmc.edu/seq-util/readseq.html Readseq online at BCM]
* [http://www.bugaco.com/bioinf/ Nexus to Fasta converter] - Needs Java
* [http://gp2fasta.ovh.org/ GenBank to Fasta conventer] - Poorly documented.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Fasta-Format — Das FASTA Format ist ein textbasiertes Format zur Darstellung und Speicherung der Primärstruktur von Nukleinsäuren (Nukleinsäuresequenz) und Proteinen (Proteinsequenz) in der Bioinformatik. Die Nukleinbasen bzw. Aminosäuren werden durch einen Ein …   Deutsch Wikipedia

  • FASTA-Format — Das FASTA Format ist ein textbasiertes Format zur Darstellung und Speicherung der Primärstruktur von Nukleinsäuren (Nukleinsäuresequenz) und Proteinen (Proteinsequenz) in der Bioinformatik. Die Nukleinbasen bzw. Aminosäuren werden durch einen Ein …   Deutsch Wikipedia

  • FASTA — is a DNA and Protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 in the article [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve db=pubmed dopt=Abstract list… …   Wikipedia

  • FASTA — steht für: FASTA Algorithmus, ein Programm zur Erfassung von Proteinen und Nukleotiden FASTA Format, ein textbasiertes FormatBKL zur Darstellung und Speicherung der Primärstruktur von Nukleinsäuren Diese Seite ist eine Begriffsklärung …   Deutsch Wikipedia

  • FASTA-Algorithmus — Der heuristische FASTA Algorithmus wurde 1985 von David J. Lipman und William R. Pearson als FASTP für Proteine entwickelt.[1] Das Programm wurde 1988 auf Nukleotide erweitert.[2] FASTA sucht nach Ähnlichkeiten zwischen Sequenzen oder vergleicht… …   Deutsch Wikipedia

  • Stockholm format — is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignmentscite journal |author=Griffiths Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A |title=Rfam: annotating non coding RNAs in… …   Wikipedia

  • Formato FASTA — Saltar a navegación, búsqueda En bioinformática, el formato FASTA es un formato de fichero informático basado en texto, utilizado para representar secuencias bien de ácidos nucleicos, bien de péptido, y en el que los pares de bases o los… …   Wikipedia Español

  • Sequence alignment — In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1]… …   Wikipedia

  • BLAST — Infobox Software name=BLAST developer=Myers, E., Altschul S.F., Gish W., Miller E.W., Lipman D.J., NCBI latest release version=2.2.18 operating system=UNIX, Linux, Mac, MS Windows genre=Bioinformatics tool license=Public Domain website=… …   Wikipedia

  • T-Coffee — Infobox Software name=T Coffee developer=Cédric Notredame, CNRS Information Génomique et Structurale latest release version=6.92 latest release date = 2008 09 12 operating system=UNIX, Linux, MS Windows|genre=Bioinformatics tool website=… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”