- Gene prediction
Gene finding typically refers to the area of
computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomicDNA , that are biologically functional. This especially includes protein-codinggene s, but may also include other functional elements such asRNA gene s andregulatory regions . Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of
homologous recombination of several different genes could determine their order on a certainchromosome , and information from many such experiments could be combined to create agenetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.Determining that a sequence "is functional" should be distinguished from determining "the function" of the gene or its product. The latter still demands "
in vivo " experimentation throughgene knockout and other assays, although frontiers ofbioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone.Extrinsic Approaches
In extrinsic (or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a
messenger RNA (mRNA) orprotein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of thegenetic code . Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact.BLAST is a widely used system designed for this purpose.A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.
Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the
RefSeq database contains transcript and protein sequence from many different species, and theEnsembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data.."Ab Initio" Approaches
Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to "
ab initio " gene finding, in whichgenomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either "signals", specific sequences that indicate the presence of a gene nearby, or "content", statistical properties of protein-coding sequence itself. "Ab initio" gene finding might be more accurately characterized as gene "prediction", since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.In the genomes of
prokaryotes , genes have specific and relatively well-understoodpromoter sequences (signals), such as thePribnow box andtranscription factor binding site s, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguousopen reading frame (ORF), which is typically many hundred or thousands ofbase pair s long. The statistics ofstop codon s are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in arandom sequence .) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy."Ab initio" gene finding in
eukaryotes , especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders areCpG island s and binding sites for a poly(A) tail.Second,
splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (exons ), separated by non-coding sequences (introns ). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex
probabilistic model s, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. TheGLIMMER system is a widely used and highly accurate gene finder for prokaryotes.GeneMark is another popular approach. Eukaryotic "ab initio" gene finders, by comparison, have achieved only limited success; notable examples are theGENSCAN andgeneid programs. A few programs like CONTRAST also usemachine learning approaches likesupport vector machines for successful gene prediction.Other Signals
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like
k-mer statistics,Fourier transform of a pseudo-number-coded DNA,Z-curve parameters and certain run features.cite journal |author=Saeys Y, Rouzé P, Van de Peer Y |title= [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/4/414 In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists] |journal=Bioinformatics |year=2007 |volume=23 |issue=4 |pages=414–420 |doi=10.1093/bioinformatics/btl639 |pmid=17204465]It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of
secondary structure in the identification of regulatory motifs has been reported.cite journal |author=Hiller M, Pudimat R, Busch A, Backofen R |title=Using RNA secondary structures to guide sequence motif finding towards single-stranded regions |journal=Nucleic Acids Res |year=2006 |volume=34 |issue=17 |pages=e117 |id=Entrez Pubmed|16987907 |doi=10.1093/nar/gkl544 |pmid=16987907] In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.cite journal |author=Patterson DJ, Yasuhara K, Ruzzo WL |title=Pre-mRNA secondary structure prediction aids splice site prediction |journal=Pac Symp Biocomput |year=2002 |pages=223–234 |id=Entrez Pubmed|11928478] cite journal |author=Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H |title=Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks |journal=Comput Biol Chem |year=2006 |volume=30 |issue=1 |pages=50–57 |id=Entrez Pubmed|16386465 |doi=10.1016/j.compbiolchem.2005.10.009] cite journal |author=Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M |title=Impact of RNA structure on the prediction of donor and acceptor splice sites |journal=BMC Bioinformatics |year=2006 |volume=7 |pages=297 |id=Entrez Pubmed|16772025 |doi=10.1186/1471-2105-7-297] Rogic, S (2006). " [http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae] ". "PhD Dissertation, University of British Columbia".]Comparative Genomics Approaches
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a
comparative genomics approach. This is based on the principle that the forces ofnatural selection cause genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes.
External links
* http://www.genefinding.org
* http://www.binf.ku.dk/users/krogh/genefinding.html
* http://www.swbic.org/links/1.4.3.2.php
* http://bio.math.berkeley.edu/slam/
* [http://www.nslij-genetics.org/gene/ Bibliography on computational gene recognition by Wentian Li]
* [http://genome.imim.es/software/geneid/ geneid]
* [http://genome.imim.es/software/sgp2/ SGP2]
* http://cbcb.umd.edu/software/glimmer
* http://cbcb.umd.edu/software/GlimmerHMM
* http://bio.math.berkeley.edu/genemapper/
* http://www.genomethreader.org
* [http://genes.mit.edu/GENSCAN.html GENSCAN]
* [http://mblab.wustl.edu/software/twinscan/ Twinscan/N-SCAN]
* [http://www.scfbio-iitd.res.in/research/genepredictor.htm CHEMGENOME]
* [http://opal.biology.gatech.edu/GeneMark/ GeneMark]
* [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Gismo]References
Wikimedia Foundation. 2010.