Gene prediction

Gene prediction

Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.

Determining that a sequence "is functional" should be distinguished from determining "the function" of the gene or its product. The latter still demands "in vivo" experimentation through gene knockout and other assays, although frontiers of bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone.

Extrinsic Approaches

In extrinsic (or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a messenger RNA (mRNA) or protein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST is a widely used system designed for this purpose.

A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.

Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data..

"Ab Initio" Approaches

Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to "ab initio" gene finding, in which genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either "signals", specific sequences that indicate the presence of a gene nearby, or "content", statistical properties of protein-coding sequence itself. "Ab initio" gene finding might be more accurately characterized as gene "prediction", since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.

In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.

"Ab initio" gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.

Second, splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.

Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. GeneMark is another popular approach. Eukaryotic "ab initio" gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN and geneid programs. A few programs like CONTRAST also use machine learning approaches like support vector machines for successful gene prediction.

Other Signals

Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like k-mer statistics, Fourier transform of a pseudo-number-coded DNA, Z-curve parameters and certain run features.cite journal |author=Saeys Y, Rouzé P, Van de Peer Y |title= [ In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists] |journal=Bioinformatics |year=2007 |volume=23 |issue=4 |pages=414–420 |doi=10.1093/bioinformatics/btl639 |pmid=17204465]

It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of secondary structure in the identification of regulatory motifs has been reported.cite journal |author=Hiller M, Pudimat R, Busch A, Backofen R |title=Using RNA secondary structures to guide sequence motif finding towards single-stranded regions |journal=Nucleic Acids Res |year=2006 |volume=34 |issue=17 |pages=e117 |id=Entrez Pubmed|16987907 |doi=10.1093/nar/gkl544 |pmid=16987907] In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.cite journal |author=Patterson DJ, Yasuhara K, Ruzzo WL |title=Pre-mRNA secondary structure prediction aids splice site prediction |journal=Pac Symp Biocomput |year=2002 |pages=223–234 |id=Entrez Pubmed|11928478] cite journal |author=Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H |title=Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks |journal=Comput Biol Chem |year=2006 |volume=30 |issue=1 |pages=50–57 |id=Entrez Pubmed|16386465 |doi=10.1016/j.compbiolchem.2005.10.009] cite journal |author=Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M |title=Impact of RNA structure on the prediction of donor and acceptor splice sites |journal=BMC Bioinformatics |year=2006 |volume=7 |pages=297 |id=Entrez Pubmed|16772025 |doi=10.1186/1471-2105-7-297] Rogic, S (2006). " [ The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae] ". "PhD Dissertation, University of British Columbia".]

Comparative Genomics Approaches

As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach. This is based on the principle that the forces of natural selection cause genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.

Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes.

External links

* [ Bibliography on computational gene recognition by Wentian Li]
* [ geneid]
* [ SGP2]
* [ Twinscan/N-SCAN]
* [ GeneMark]
* [ Gismo]


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Gene — For a non technical introduction to the topic, see Introduction to genetics. For other uses, see Gene (disambiguation). This stylistic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). The… …   Wikipedia

  • Gene expression profiling — Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left… …   Wikipedia

  • List of RNA structure prediction software — This list of RNA structure prediction software is a compilation of software tools and web portals used for RNA structure prediction.ingle sequence structure predictionncRNA gene prediction software See also * RNA * Non coding RNA * RNA… …   Wikipedia

  • Protein-protein interaction prediction — is a field combining bioinformatics and structural biology in an attempt to identify and catalog interactions between pairs or groups of proteins. Understanding protein protein interactions is important in investigating intracellular signaling… …   Wikipedia

  • Nucleic acid structure prediction — This article is about the computational prediction of nucleic acid structure. For experimental methods, see Nucleic acid structure determination. Nucleic acid structure prediction is a computational method to determine nucleic acid secondary and… …   Wikipedia

  • Earthquake prediction — An earthquake prediction is a prediction that an earthquake in a specific magnitude range will occur in a specific region and time window. Predictions are considered as such to the extent that they are reliable for practical, as well as… …   Wikipedia

  • Protein structure prediction — is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information… …   Wikipedia

  • De novo protein structure prediction — In computational biology, de novo protein structure prediction is the task of estimating a protein s tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused… …   Wikipedia

  • MLL (gene) — HRX redirects here. For the bank holding, see Hypo Real Estate. Myeloid/lymphoid or mixed lineage leukemia (trithorax homolog, Drosophila) PDB rendering based on 2j2s …   Wikipedia

  • TRO (gene) — Trophinin, also known as TRO, is a human gene.cite web | title = Entrez Gene: TRO trophinin| url = Cmd=ShowDetailView TermToSearch=7216| accessdate = ] PBB Summary section title = summary text =… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”