- Population stratification
Population stratification is the presence of a systematic difference in
allele frequencies between subpopulations in apopulation possible due to different ancestry. Population stratification is also referred to as population structure.Causes of population stratification
The most obvious cause of population stratification is
migration whereindividual s from one population migrates into another population. Aftergeneration s the population stratification will become less due to admixture. Another form of population stratification is spurious relatedness where non randommating causes a certain subpopulation to be more related with each other compared to the rest of the population.Population stratification and association studies
Population stratification can be a problem for association studies, such as case-control studies, where the association found could be due to the underlying structure of the population and not a disease associated locus. Also the real disease causing locus might not be found in the study if the locus is less prevalent in the population where the case subjects are chosen. Therefore it is oftenpreferable to use family based data where the effect of population stratification can easilybe controlled for. But if the structure is known or a putative structure is found there is anumber of possible ways to implement this structure in the association studies and thuscompensating for any population bias. Another possibility is using unlinked markersto control the possible inflation of the number of false positives. This is known as genomic control.
Genomic Control
The assumption of population homogeneity in association studies, especially case-controlstudies, can easily be violated and can lead to both type I and type II errors. It istherefore important for the models used in the study to compensate for the populationstructure. The problem in case control studies is that if there is a genetic involvement inthe disease the case population is more likely to be related than the individuals in thecontrol population. This means that the assumption of independence of observations isviolated. Often this will lead to an overestimation of the significance of an associationbut it depends on the way the sample was chosen. As long as there is a higher allelefrequency in a subpopulation you will find association with any trait more prevalentin the case population [Lander, E. S. and Schork, N. J. (1994). Genetic dissection of complex traits, Science 265(5181): 2037–2048.] . This kind of spurious associationincreases as the sample population grows so the problem should be of special concern inlarge scale association studies when loci only cause relatively small effects on the trait. A method that in some cases can compensate for the above described problems has been developed by Devlin andRoeder (1999) [Devlin, B. and Roeder, K. (1999). Genomic control for association studies, Biometrics55(4): 997–1004.] . It uses both a frequentist and a Bayesian approach. The latter beingappropriate when dealing with a large number of candidate genes. Here is a short description of how the frequentist way of correcting for population stratification works. It work by using markers that are not linked with the trait in question to correctfor any inflation of the statistic caused by population stratification. The method wasfirst developed for binary traits but has since been generalized for quantitative ones [Bacanu, S.-A., Devlin, B. and Roeder, K. (2002). Association studies for quantitativetraits in structured populations, Genet Epidemiol 22(1): 78–93.] . For the binary one, which applies to finding genetic differencesbetween the case and control populations, Devlin and Roeder (1999) uses Armitage’strend test
and the test for allelic frequencies
If the population is in Hardy-Weinberg equilibrium the two statistics are approximatelyequal. Under the null hypothesis of no population stratification the trend test isasymptotic distribution with one degree of freedom.The idea is that the statistic is inflated by a factor so that where depends on the effect of stratification. The above method rests upon the assumption that the inflationfactor is constant, which means that the loci should have roughly equal mutationrates, should not be under different selection in the two populations, and the amount ofHardy-Weinberg disequilibrium measured in Wright’s coefficient of inbreeding F shouldnot differ between the different loci. The latter being of greatest concern. If the effect ofthe stratification is similar across the different loci can be estimated from the unlinkedmarkerswhere L is the number of unlinked markers. The denominatoris derived from the gamma distribution as a robust estimator of . Other estimatorshave been suggested, for example, [Reich, D. E. and Goldstein, D. B. (2001). Detecting association in a case-control studywhile correcting for population stratification, Genet Epidemiol 20(1): 4–16.] suggested using the meanof the statistics instead.This is not the only way to estimate but according to [Bacanu, S. A., Devlin, B. and Roeder, K. (2000). The power of genomic control, Am JHum Genet 66(6): 1933–1944.] it is anappropriate estimate even if some of the unlinked markers are actually in disequilibriumwith a disease causing locus or are themselves associated with the disease. Under thenull hypothesis and when correcting for stratification using L unlinked genes, isapproximately distributed. With this correction theoverall type I error rate should be approximately equal to even when the populationis stratified.Devlin and Roeder (1999) [Devlin, B. and Roeder, K. (1999). Genomic control for association studies, Biometrics55(4): 997–1004.] mostly considered the situation where gives a95% confidence level and not smaller p-values. Marchini et al. (2004) [Marchini, J., Cardon, L. R., Phillips, M. S. and Donnelly, P. (2004). The effects of humanpopulation structure on large genetic association studies, Nat Genet 36(5): 512–517.] demonstrates bysimulation that genomic control can lead to an anti-conservative p-value if this valueis very small and the two populations (case and control) are extremely distinct. Thiswas especially a problem if the number of unlinked markers were in the order 50 − 100.This can result in false positives (at that significance level).
Notes & references
Wikimedia Foundation. 2010.