- GeneMark
GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous
Markov chain models of protein-codingDNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on agenetic code in one of six possible frames (including three frames incomplementary DNA strand) or to be “non-coding”.GeneMark.hmm (prokaryotic)
GeneMark.hmm algorithm was designed to improve the gene prediction quality, particularly to improve GeneMark in finding exact gene starts. The idea was to integrate the GeneMark models into naturally designed
hidden Markov model framework with gene boundaries modeled as transitions between hidden states. Additionally, the ribosomebinding site model is used to make the gene start predictions more accurate. In evaluations by different groups it was shown that GeneMark.hmm is significantly more accurate than GeneMark in exact gene prediction. From 1998 until now GeneMark.hmm and its self-training version, GeneMarkS, are the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.GeneMark.hmm (eukaryotic)
Next step after developing prokaryotic GeneMark.hmm was to extend the approach to the eukaryotic genomes where accurate prediction of protein coding exon boundaries presents the major challenge.The HMM architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for initiation site, termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.
Heuristic Models
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. A heuristic method for derivation of parameters of inhomogeneous Markov models of protein coding regions. was proposed in 1999. The heuristic method utilizes the observation that parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content. Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nt) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm. Models built by the heuristic approach could be used to find genes in small fragments of anonymous
prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses,phage s andplasmid s. This method can also be used for highly inhomogeneous genomes where adjustment of the Markov models to local DNA composition is needed. The heuristic method provides an evidence that themutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.Family of gene prediction programs
Bacteria, Archaea and metagenomes
* GeneMark-P
* GeneMark.hmm-P
* GeneMarkSEukaryotes
* GeneMark-E
* GeneMark.hmm-E
* GeneMark.hmm-ESViruses, phages and plasmids
* Heuristic approachEST and cDNA
* GeneMark-EExternal links
* [http://opal.biology.gatech.edu/GeneMark/ GeneMark homepage]
* [http://opal.biology.gatech.edu/GeneMark/background.html Background information]
* [http://opal.biology.gatech.edu/GeneMark/references.html References]References
GeneMarkBorodovsky M. and McIninch J."GeneMark: parallel gene recognition for both DNA strands."Computers & Chemistry, 1993, Vol. 17, No. 19, pp. 123-133Abstract | Article
GeneMark.hmmLukashin A. and Borodovsky M."GeneMark.hmm: new solutions for gene finding."
Nucleic Acids Research, 1998, Vol. 26, No. 4, pp. 1107-1115Medline | ArticleHeuristic ModelsBesemer J. and Borodovsky M."Heuristic approach to deriving models for gene finding."Nucleic Acids Research, 1999, Vol. 27, No. 19, pp. 3911-3920Medline | Article
GeneMarkSBesemer J., Lomsadze A. and Borodovsky M."GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in
regulatory regions ."Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618Medline | ArticleVIOLINMills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M."Improving gene annotation in complete viral genomes."Nucleic Acids Research, 2003, Vol. 31, No. 23, 7041-7055Medline | Article
GeneMark Web serverBesemer J. and Borodovsky M."GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses."Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454Medline | Article
GeneMark.hmm-ESLomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M."Gene identification in novel eukaryotic genomes by self-training algorithm."Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506Medline | Article
Wikimedia Foundation. 2010.