Substitution model

Substitution model

A substitution model describes the process from which a sequence of characters of a fixed size from some alphabet changes into another set of traits. For example, in cladistics, each position in the sequence might correspond to a property of a species which can either be present or absent. The alphabet could then consist of "0" for absence and "1" for presence. Then the sequence 00110 could mean, for example, that a species does not have feathers or lay eggs, does have fur, is warm-blooded, and cannot breathe underwater. Another sequence 11010 would mean that a species has feathers, lays eggs, does not have fur, is warm-blooded, and cannot breathe underwater. In phylogenetics, sequences are often obtained by firstly obtaining a nucleotide or protein sequence alignment, and then taking the bases or amino acids at corresponding positions in the alignment as the characters. Sequences achieved by this might look like AGCGGAGCTTA and GCCGTAGACGC.

Substitution models are used for a number of things:
# Constructing evolutionary trees in phylogenetics or cladistics.
# Simulating sequences to test other methods and algorithms.

Neutral, independent, finite sites models

Most substitution models used to date are neutral, independent, finite sites models.; Neutral : Selection does not operate on the substitutions, and so they are unconstrained.; Independent : Changes in one site do not affect the probability of changes in another site.; Finite Sites : There are finitely many sites, and so over evolution, a single site can be changed multiple times. This means that, for example, if a character has value 0 at time 0 and at time "t", it could be that no changes occurred, or that it changed to a 1 and back to a 0, or that it changed to a 1 and back to a 0 and then to a 1 and then back to a 0, and so on.

The molecular clock and the units of time

Different substitution models deal with time differently.
*It is very common to measure time in substitutions. For example, if one was going to construct a phylogenetic tree while employing a substitution model, one could just measure the distance along the branches of the trees in substitutions. This is convenient because it avoids any question of whether the rate of substitution with respect to the unit of time has changed or not (because by definition the number of substitutions per substitution is one), and it does not need any information about timescales that could be called into question.
*The molecular clock assumption is also very common, namely that the rate of substitutions with respect to time is constant. This is just multiplying factor (usually called mu, the number of substitutions per unit time) different from measuring time in substitutions. To carry out this type of analysis, one needs to estimate mu first (which requires knowledge of at least one branch length ahead of time, often a difficult task, which can easily be disputed by others).
*The assumption of a molecular clock is often unrealistic, especially across long periods of evolution. For example, even though rodents are genetically very similar to primates, they have undergone a much higher number of substitutions in the estimated time since divergence in some regions of the genome [X Gu and W Li (1992): Higher rates of amino acid substitution in rodents than in man. Molecular Phylogenetics and Evolution, 1:211-214. [http://dx.doi.org/10.1016/1055-7903(92)90017-B DOI] ] . This could be due to their shorter generation time [WH Li, J Ellsworth, BH Krushkal, J Chang, and D Hewett-Emmett (1996): Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Molecular Phylogenetics and Evolution, 5:182-187. [http://dx.doi.org/10.1006/mpev.1996.0012 DOI] ] , higher metabolic rate, increased population structuring, increased rate of speciation, or smaller body size [AP Martin and SR Palumbi (1993): Body size, metabolic rate, generation time, and the molecular clock. Proceedings of the National Academy of Science, USA, 90:4087-4091. [http://www.pnas.org/cgi/content/abstract/90/9/4087 PNAS] ] [Z Yang and R Nielsen (1998): Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of Molecular Evolution, 46:409-418. [http://dx.doi.org/10.1007/PL00006320 DOI] ] . When studying events like the Cambrian explosion under a molecular clock assumption, poor concurrence between cladistic and phylogenetic data is often observed. There has been some work on models allowing variable rate of evolution (see for example [H Kishino, JL Thorne and WJ Bruno (2001): Performance of a divergence estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18: 352-361. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=11230536&itool=iconfft&query_hl=18&itool=pubmed_docsum PubMed] ] and [JL Thorne, H Kishino and IS Painter (1998): Estimating the rate of evolution of the rate of molecular evolution: Molecular Biology and Evolution 15: 1647-1657. [http://mbe.oxfordjournals.org/cgi/content/abstract/15/12/1647 MBE] ] ).

Time-reversible models

Most useful substitution models are time-reversible. In terms of substitution models, this simply means that over time, the relative frequencies of each character do not change.

For a time reversible model, there is no assumption that substitutions preferentially change in certain directions over time. For example A -> C -> G is the same as G -> C -> A.

The reason for this is because when an analysis of real biological data is performed, there is generally no access to the sequences of ancestral species, only to the species present today. However, when a model is time-reversible, which species was the ancestral species is irrelevant. Instead, the phylogenetic tree can be rooted along the branch leading to any arbitrary extant species, re-rooted later based on new knowledge, or left unrooted.

A time reversible model satisfies the following properly pi_1Q_{12} = pi_2Q_{21}

The mathematics of substitution models

Neutral, independent, finite sites models (assuming a constant rate of evolution) have two parameters, Pi, a vector of base (or character) frequencies at time zero (for a time reversible model, this vector is usually referred to as the equilibrium base frequencies, and applies at all times), and the rate matrix, Q, which describes the rate at which bases of one type change into bases of another type, Q_{ij} for i e j is the rate at which base i goes to base j. For convenience, the diagonals of the Q matrix are chosen so the rows sum to zero.

Q_{ii} = - {sum_{i e j} Q_{ij

The transition matrix function is a function from the branch lengths(in some units of time, possibly in substitutions), to a matrix of conditional probabilities. It is denoted P(t) The entry in the i-th column and the j-th row (P_{ij}(t)) is the probability, after time t, that there is a base j at a given position, conditional on there being a i in that position at time 0. When the model is time reversible, this can be performed between any two sequences, even if one is not the ancestor of the other, if you know the total branch length between them.

The asymptotic properties of P_{ij}(t) are such that lim_{t ightarrow 0} P_{ij}(t) = Pi_{i}, i.e. there is no change in base composition between a sequence and itself, and lim_{t ightarrow infty} P_{ij}(t) = Pi_{j}, or in other words, as time goes to infinity, the probability of finding base j at a position given there was an i at that position originally goes to the probability that there is base j at that position (regardless of the original base).

The transition matrix can be computed from the rate matrix and the equilibrium base frequencies by P(t) = e^{Qt}. Since Q is a matrix, this is a matrix exponential, and must be approximated by the Taylor series expansion P(t) = sum_{n=0}^{infty}{Q^n t^n} over {n!}.

The time reversibility(or stationarity) constraint is Pi Q = 0 (because the rows where defined to sum to zero, and the overall base frequencies must not systematically change from Pi). This is equivalent to saying Pi P(t) = Pi for all t.

GTR: Generalised time reversible

GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.cite journal |author=Tavaré S |title=Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences |journal=American Mathematical Society: Lectures on Mathematics in the Life Sciences |volume=17 |pages=57–86 |url=http://www-hto.usc.edu/people/stavare/STpapers-pdf/T86.pdf]

The GTR parameters consist of an equilibrium base frequency vector, Pi = (pi_1 pi_2 pi_3 pi_4), giving the frequency at which each base occurs at each site, and the rate matrix

: Q = egin{pmatrix} {-(x_1+x_2+x_3)} & x_1 & x_2 & x_3 \ {pi_1 x_1 over pi_2} & {-({pi_1 x_1 over pi_2} + x_4 + x_5)} & x_4 & x_5 \ {pi_1 x_2 over pi_3} & {pi_2 x_4 over pi_3} & {-({pi_1 x_2 over pi_3} + {pi_2 x_4 over pi_3} + x_6)} & x_6 \ {pi_1 x_3 over pi_4} & {pi_2 x_5 over pi_4} & {pi_3 x_6 over pi_4} & {-({pi_1 x_3 over pi_4} + {pi_2 x_5 over pi_4} + {pi_3 x_6 over pi_4})} end{pmatrix}

Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. Since the 4 frequency parameters have to sum to 1, there are really only 3 free frequency parameters. This is usually further eliminated down to 8 parameters plus mu, the overall number of substitutions per unit time. When measuring time in substitutions (mu=1) only 8 free parameters remain.

In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site n^2-n} over 2} , and then add "n-1" for the equilibrium base frequencies, and subtract 1 because mu is fixed. You get

: n^2-n} over 2} + (n - 1) - 1 = {1 over 2}n^2 + {1 over 2}n - 2.

For example, for an amino acid sequence (there are 20 "standard" amino acids that make up proteins), you would find there are 208 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model (a codon is three bases and codes for one amino acid in a protein). There are 4^3 = 64 codons, resulting in 2078 free parameters, but when the rates for transitions between codons which differ by more than one base are assumed to be zero, then there are only 20 imes 19 imes 3} over 2} + 63 - 1 = 632 parameters.

Mechanistic vs. empirical models

A main difference in evolutionary models is how many parameters are estimated every time for the data set underconsideration and how many of them are estimated once on a large data set. Mechanistic models describe all substitution as a function of a number of parameters which are estimated for every data set analyzed, preferably using maximum likelihood. This has the advantage that the model can be adjusted to the particularities of a specific data set (e.g. different composition biases in DNA). Problems can arise when too manyparameters are used, particularly if they can compensate for each other. Then it is often the case that the data set is too small to yield enough information to estimate all parameters accurately.

Empirical models are created by estimating many parameters (typically all entries of the rate matrix and the character frequencies, see the GTR model above) from a large data set. These parameters are then fixed and will bereused for every data set. This has the advantage that those parametes can be estimated more accurately. Normally, itis not possible to estimate all entries of the substitution matrix from the current data set only. On the downside,the estimated parameters might be too generic and don't fit a particular data set well enough.

With the large-scale genome sequencing still producing very large amounts of DNA and protein sequences, there isenough data available to create empirical models with any number of parameters. Because of the problems mentioned above, the two approaches are often combined, by estimating most of the parameters once on large-scale data, while a few remaining parameters are then adjusted to the data set under consideration. The following sections give an overview of the different approaches taken for DNA, protein or codon-based models.

Models of DNA substitution

See main article: Models of DNA evolution for more formal descriptions of the DNA models.

Models of DNA evolution were first proposed in 1969 by Jukes and Cantor [Jukes, TH and Cantor, CR. 1969. Evolution of protein molecules. Pp. 21-123 in H. N. Munro, ed. "Mammalian protein metabolism". Academic Press, New York.] , assuming equal transition rates as well as equal equilibrium frequencies for all bases. In 1980 Kimura [Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. "Journal of Molecular Evolution" 16:111-120. [http://dx.doi.org/10.1007/BF01731581 DOI] ] introduced a model with two parameters: one for the transition and one for the transversion rate and in 1981, Felsenstein [Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. "Journal of Molecular Evolution" 17:368-376. [http://dx.doi.org/10.1007/BF01734359 DOI] ] made a model in which the substitution rate corresponds to the equilibrium frequency of the target nucleotide. Hasegawa, Kishino and Yano (HKY) [Hasegawa, M., Kishino, H and Yano, T. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. "Journal of Molecular Evolution" 22:160-174. [http://dx.doi.org/10.1007/BF02101694 DOI] ] unified the two last models to a six parameter model. In the 1990s, models similar to the HKY one have been developed and refined by several researchers (e.g. [Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. "Molecular Biology and Evolution" 9:678-687. [http://mbe.oxfordjournals.org/cgi/content/abstract/9/4/678 MBE] ] and [Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. "Molecular Biology and Evolution" 10:512-526. [http://mbe.oxfordjournals.org/cgi/content/abstract/10/3/512 MBE] ] ).

For DNA substitution models, mainly mechanistic models (as described above) are employed. The small number of parametersto estimate makes this feasible, but also DNA is often highly optimized for specific purposes (e.g. fast expression or stability) depending on the organism and the type of gene, making it necessary to adjust the model to these circumstances.

Models of amino acid substitutions

For many analyses, particularly for longer evolutionary distances, the evolution is modeled on the amino acid level. Since notall DNA substitution also alter the encoded amino acid, information is lost when looking at amino acids instead of nucleotidebases. However, several advantages speak in favor of using the amino acid information: DNA is much more inclined to show compositional bias than amino acids, not all positions in the DNA evolve at the same speed (non-synonymous mutationsare more likely to become fixed in the population than synonymous ones), but probably most important, because of thosefast evolving positions and the limited alphabet size (only four possible states), the DNA suffers much more from back substitutions, making it difficult to accurately estimate longer distances.

Unlike the DNA models, amino acid models traditionally are empirical models. They were pioneered in the 1970s by Dayhoff and co-workerscite journal |author=Dayhoff MO, Schwartz RM, Orcutt BC |title=A model for evolutionary change in proteins |journal=Atlas of Protein Sequence and Structure |volume=5 |pages=345–352 |year=1978] , by estimating replacement rates from protein alignments with at least 85% identity. This minimized the chances of observing multiple substitutions at a site. From the estimated rate matrix, a series of replacement probability matrices were derived, known under names such as PAM250. The Dayhoff model was used to assess the significance of homology search results, but also for phylogenetic analyses. The Dayhoff PAM matrices were based on relatively few alignments (since not more were available at that time), but in the 1990s, new matrices were estimated using almost the same methodology, but based on the large protein databases available then (cite journal |author=Gonnet GH, Cohen MA, Benner SA |title=Exhaustive matching of the entire protein sequence database |journal=Science |volume=256 |pages=1443–1445 |year=1992 |doi=10.1126/science.1604319 |pmid=1604319] cite journal |author=Jones DT, Taylor WR, Thornton JM |title=The rapid generation of mutation data matrices from protein sequences |journal=Comput Applic Biosci |volume=8 |pages=275–282 |year=1992] , the latter being known as "JTT" matrices).

References


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Substitution matrix — In evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments,… …   Wikipedia

  • Substitution tiling — A tile substitution is a useful method for constructing highly ordered tilings. Most importantly, some tile substitutions generate aperiodic tilings, which are tilings whose prototiles do not admit any tiling with translational symmetry. The most …   Wikipedia

  • Model theory — This article is about the mathematical discipline. For the informal notion in other parts of mathematics and science, see Mathematical model. In mathematics, model theory is the study of (classes of) mathematical structures (e.g. groups, fields,… …   Wikipedia

  • Actor model — In computer science, the Actor model is a mathematical model of concurrent computation that treats actors as the universal primitives of concurrent digital computation: in response to a message that it receives, an actor can make local decisions …   Wikipedia

  • National Information Exchange Model — The National Information Exchange Model (NIEM) is an XML based information exchange framework from the United States. NIEM represents a collaborative partnership of agencies and organizations across all levels of government (federal, state,… …   Wikipedia

  • Penal substitution — is a theory of the atonement within Christian theology, especially associated with the Reformed tradition. It argues that Christ, by his own sacrificial choice, was punished (penalised) in the place of sinners (substitution), thus satisfying the… …   Wikipedia

  • Sensory substitution — means to transform the characteristics of one sensory modality into stimuli of another sensory modality. It is hoped that sensory substitution systems can help handicapped people by restoring their ability to perceive a certain defective sensory… …   Wikipedia

  • North-South model — The North–South model, developed largely by Columbia University economics professor Ronald Findlay, is a model in developmental economics that explains the growth of a less developed South or periphery economy that interacts through trade with a… …   Wikipedia

  • Models of nucleotide substitution — are mathematical equations built to predict the probability (or proportion) of nucleotide change expected in a sequence.Jukes and Cantor s one parameter modelJC69 is the simplest of the models of nucleotide substitution. cite book | author =… …   Wikipedia

  • Heckscher-Ohlin model — The Heckscher Ohlin model (H O model) is a general equilibrium mathematical model of international trade, developed by Eli Heckscher and Bertil Ohlin at the Stockholm School of Economics. It builds on David Ricardo s theory of comparative… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”