GeneMark

GeneMark

GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be “non-coding”.

GeneMark.hmm (prokaryotic)

GeneMark.hmm algorithm was designed to improve the gene prediction quality, particularly to improve GeneMark in finding exact gene starts. The idea was to integrate the GeneMark models into naturally designed hidden Markov model framework with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene start predictions more accurate. In evaluations by different groups it was shown that GeneMark.hmm is significantly more accurate than GeneMark in exact gene prediction. From 1998 until now GeneMark.hmm and its self-training version, GeneMarkS, are the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.

GeneMark.hmm (eukaryotic)

Next step after developing prokaryotic GeneMark.hmm was to extend the approach to the eukaryotic genomes where accurate prediction of protein coding exon boundaries presents the major challenge.The HMM architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for initiation site, termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.

Heuristic Models

Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. A heuristic method for derivation of parameters of inhomogeneous Markov models of protein coding regions. was proposed in 1999. The heuristic method utilizes the observation that parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content. Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nt) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes where adjustment of the Markov models to local DNA composition is needed. The heuristic method provides an evidence that themutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.

Family of gene prediction programs

Bacteria, Archaea and metagenomes
* GeneMark-P
* GeneMark.hmm-P
* GeneMarkS

Eukaryotes
* GeneMark-E
* GeneMark.hmm-E
* GeneMark.hmm-ES

Viruses, phages and plasmids
* Heuristic approach

EST and cDNA
* GeneMark-E

External links

* [http://opal.biology.gatech.edu/GeneMark/ GeneMark homepage]
* [http://opal.biology.gatech.edu/GeneMark/background.html Background information]
* [http://opal.biology.gatech.edu/GeneMark/references.html References]

References

GeneMarkBorodovsky M. and McIninch J."GeneMark: parallel gene recognition for both DNA strands."Computers & Chemistry, 1993, Vol. 17, No. 19, pp. 123-133Abstract | Article

GeneMark.hmmLukashin A. and Borodovsky M."GeneMark.hmm: new solutions for gene finding."
Nucleic Acids Research, 1998, Vol. 26, No. 4, pp. 1107-1115Medline | Article

Heuristic ModelsBesemer J. and Borodovsky M."Heuristic approach to deriving models for gene finding."Nucleic Acids Research, 1999, Vol. 27, No. 19, pp. 3911-3920Medline | Article

GeneMarkSBesemer J., Lomsadze A. and Borodovsky M."GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions."Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618Medline | Article

VIOLINMills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M."Improving gene annotation in complete viral genomes."Nucleic Acids Research, 2003, Vol. 31, No. 23, 7041-7055Medline | Article

GeneMark Web serverBesemer J. and Borodovsky M."GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses."Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454Medline | Article

GeneMark.hmm-ESLomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M."Gene identification in novel eukaryotic genomes by self-training algorithm."Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506Medline | Article


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Gene prediction — Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein coding genes, but …   Wikipedia

  • Mark Borodovsky — is a Regents Professor of Biomedical Engineering and Computational Science and Engineering and Director of Center for Bioinformatics and Computational Genomics at Georgia Tech. He is a Founder of the Georgia Institute of Technology Bioinformatics …   Wikipedia

  • Predicción de genes — Saltar a navegación, búsqueda …   Wikipedia Español

  • GPR158 — G protein coupled receptor 158, also known as GPR158, is a human gene.cite web | title = Entrez Gene: GPR158 G protein coupled receptor 158| url = http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene Cmd=ShowDetailView TermToSearch=57512| accessdate …   Wikipedia

  • GPR157 — G protein coupled receptor 157, also known as GPR157, is a human gene.cite web | title = Entrez Gene: GPR157 G protein coupled receptor 157| url = http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene Cmd=ShowDetailView TermToSearch=80045| accessdate …   Wikipedia

  • PAK4 — P21 protein (Cdc42/Rac) activated kinase 4 PDB rendering based on 2bva …   Wikipedia

  • SEC14L2 — SEC14 like 2 (S. cerevisiae), also known as SEC14L2, is a human gene.cite web | title = Entrez Gene: SEC14L2 SEC14 like 2 (S. cerevisiae)| url = http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene Cmd=ShowDetailView TermToSearch=23541| accessdate =… …   Wikipedia

  • Cofactor of BRCA1 — Identifiers Symbols COBRA1; DKFZp586B0519; KIAA1182; NELF B; NELFB External IDs …   Wikipedia

  • CYFIP2 — Cytoplasmic FMR1 interacting protein 2, also known as CYFIP2, is a human gene.cite web | title = Entrez Gene: CYFIP2 cytoplasmic FMR1 interacting protein 2| url = http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene Cmd=ShowDetailView… …   Wikipedia

  • PACS1 — Phosphofurin acidic cluster sorting protein 1 Identifiers Symbols PACS1; FLJ10209; KIAA1175 External IDs …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”