- Tajima's D
Tajima's D is a statistical test created by and named after the Japanese researcher
Fumio Tajima . The purpose of the test is to distinguish between a DNA sequence evolving randomly ("neutrally") versus one evolving under a non-random process, includingdirectional selection orbalancing selection , demographic expansion or contraction,genetic hitchhiking , or introgression. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called "neutral", while mutations under selection are "non-neutral". For example, you would expect to find that a mutation which causes prenatal death or severe disease to be under selection.According to
Motoo Kimura 'sneutral theory of molecular evolution , the majority of mutations in the human genome are neutral, ie have no effect on fitness and survival. When looking at the human population as a whole, we say that the populationfrequency of a neutral mutation fluctuates randomly (ie the percentage of people in the population with the mutation changes from one generation to the next, and this percentage is equally likely to go up or down), throughgenetic drift .The strength of genetic drift depends on the population size. If a population is at a constant size with constant mutation rate, the population will reach an equilibrium of gene frequencies. This equilibrium has important properties, including the number of
segregating site s , and the number of nucleotide differences between pairs sampled (these are called pairwise differences). To standardize the pairwise differences, the mean or 'average' number of pairwise differences is used. This is simply the sum of the pairwise differences divided by the number of pairs, and is signified by .The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between
mutation andgenetic drift . In order to perform the test on a DNA sequence or gene, you need to sequence homologous DNA for at least 3 individuals. Tajima's test compares a standardized measure of the total number of segregating sites (these are DNA sites that are polymorphic) in the sampled DNA and the average number of mutations between pairs in the sample. If these two numbers are the same (or close enough that the difference between them is less that two standard deviations from the average), then the null hypothesis of neutrality cannot be rejected. Otherwise, the null hypothesis of neutrality is rejected.Hypothetical example
Each
chromosome in the human genome can be represented as a longDNA sequence.Genes represent 2% of this sequence, while much of the other 98% is "junk DNA" not coding a functional protein. Genes are under selection, while the junk evolves randomly.Lets say that you are a genetic researcher who finds two mutations, a mutation in a gene which causes pre-natal death and a mutation in junk DNA which has no effect on human health or survival. You publish your findings in a scientific journal, identifying the first mutation as "under negative selection" and the second as "neutral". The neutral mutation gets passed on from one generation to the next, while the mutation under negative selection disappears, since anyone with the mutation cannot reproduce and pass it on to the next generation.
In order to back your discovery with more scientific evidence, you gather DNA samples from 100 people and determine the exact DNA sequence for the gene in each of these 100. Using all 100 DNA samples as input, you determine Tajima's D on both the gene and the junk DNA. If your hypothesis was correct, then Tajima's Test will output "neutral" for the junk DNA and "non-neutral" for the gene.
cientific explanation
Under the neutral theory model, for a population at constant size at equilibrium: for diploid DNA, and for haploid. But, selection, demographic fluctuations and other violations of the neutral model (including rate heterogeneity and introgression) will change the expected values of and , so that they are no longer expected to be equal. The difference in the expectations for these two variables (which can be positive or negative) is the crux of Tajima's "D" test statistic.
is calculated by taking the difference between the two estimates of the population genetics parameter . This difference is called , and D is calculated by dividing by the square root of its
variance (itsstandard deviation , by definition).:
Fumio Tajima demonstrated by computer simulation that the statistic described above could be modeled using abeta distribution . If the value for a sample of sequences is outside theconfidence interval then one can reject thenull hypothesis ofneutral mutation for the sequence in question.tatistical test
When performing a statistical test such as Tajima's D, you have a 'null hypothesis', an 'alternative hypothesis', a
distribution and aconfidence interval . The uppercase D calculated in the example below represents "how many standard deviations from the mean your lower-case d is". In the example below the answer was 2 standard deviations. To keep matters simple, lets just say that this is "really far" from the mean. Thus the chances of "randomly" getting a D of 2 or greater is extremely rare, (lets say 1%). Thus the value of D obtained is outside of the 99% confidence interval for the "null hypothesis". Thus the conclusion of the experiment is to "reject the null hypothesis."In Tajima's Test, the null hypothesis is neutral evolution. This means that the 4 polymorphisms in the example have no effect on evolutionary fitness, hence their frequency in the population fluctuates randomly from one generation to the next. In the example below, we reject the null hypothesis, thus there is some evolutionary pressure on the genetic sequence in the example.
In other words, our unknown genetic sequence has some (still unknown) function, which causes its population frequency to deviate from what we'd expect for a neutral sequence.Mathematical details
:where:
and are two estimates of the expected number of
single nucleotide polymorphism s (SNPs)between two DNA sequences under theneutral mutation model in a sample size from aneffective population size The first estimate is the average number of SNPs found in (n choose 2) pairwise comparisons of sequences in the sample:
The second estimate is derived from the
expected value of , the total number of polymorphisms in the sample:Tajima defines , whereas Hartl & Clark use a different symbol to define the same parameter .
Historical example
The genetic mutation which causes
sickle-cell anemia is non-neutral because it affects survival and fitness. Peoplehomozygous for the mutation have the sickle-cell disease, while those without the mutation (homozygous for the wild-typeallele ) do not have the disease. People with one copy of the mutated allele (heterozygous ) do not have the disease, but instead are resistant tomalaria . Thus inAfrica , where there is a prevalence of the malaria parasite "Plasmodium falciparum " that is transmitted through mosquitosAnopheles , there is a selective advantage for heterozygous individuals. Meanwhile, in countries such as the USA where the risk of malaria infection is low, the population frequency of the mutation is lower.Example
Suppose you are a geneticist studying an unknown gene. As part of your research you get DNA samples from four random people (plus yourself). For simplicity, you label your sequence as a string of zeroes, and for the other four people you put a zero when their DNA is the same as yours and a one when it is different. (For this example, the specific type of difference is not important.)
Position 12345 67890 12345 67890
Notice the four polymorphic sites (positions where someone differs from you, at 3, 7, 13 and 19 above). Now compare each pair of sequences and get the
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010average number of polymorphisms between two sequences. There are "five choose two" (ten) comparisons that need to be done.
Person Y is you!You vs A
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
3 polymorphisms
You vs B
Person Y 00000 00000 00000 00000
Person B 00000 00000 00100 00010
2 polymorphisms
You vs C
Person Y 00000 00000 00000 00000
Person C 00000 01000 00000 00010
2 polymorphisms
You vs D
Person Y 00000 00000 00000 00000
Person D 00000 01000 00100 00010
3 polymorphisms
A vs B
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
1 polymorphism
A vs C
Person A 00100 00000 00100 00010
Person C 00000 01000 00000 00010
3 polymorphisms
A vs D
Person A 00100 00000 00100 00010
Person D 00000 01000 00100 00010
2 polymorphisms
B vs C
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
2 polymorphisms
B vs D
Person B 00000 00000 00100 00010
Person D 00000 01000 00100 00010
1 polymorphism
C vs D
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010
1 polymorphism
The average number of polymorphisms is .The lower-case "d" described above is the difference between these two numbers -- the average number of polymorphisms found in pairwise comparison (2) and the total number of polymorphic sites (4). Thus .
Since this is a statistical test, you need to divide lower-case "d" by its
standard deviation to get upper-case "D". Let's say that someone else already did this calculation for you, and obtained astandard deviation of 1. Thus Tajima's "D" for your set of sequences is −2.Significance
A negative Tajima's D signifies an excess of low frequency polymorphisms, indicating population size expansion and/or purifying (negative) selection. A positive Tajima's D signifies low levels of both low and high frequency polymorphisms, indicating a decrease in population size and/or balancing selection.
References
[http://www.genetics.org/cgi/content/abstract/123/3/585] Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism. Fumio Tajima. Genetics, 123: 585-595.
[http://scholar.google.com/scholar?sourceid=Mozilla-search&q=principles+of+population+genetics+hartl+clark] Principles of Population Genetics, 4th ed. Daniel L. Hartl & Andrew G. Clark. Sinauer Associates, Inc. 2007Computational tools for Tajima's D test
* [http://www.ub.es/dnasp/] DNAsp (Windows)
* [http://www.ub.es/softevol/variscan/] Variscan (Mac OS X, Linux, Windows)
* [http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doLoadUrl=submit&hgS_loadUrlName=http://www.soe.ucsc.edu/~daryl/tajDsession.txt] Online view of Tajima'S D values in human genome* MEGA4
Wikimedia Foundation. 2010.