Spurious relationship

Spurious relationship

In statistics, a spurious relationship (or, sometimes, spurious correlation or spurious regression) is a mathematical relationship in which two events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "confounding factor" or "lurking variable"). Suppose there is found to be a correlation between A and B. Aside from coincidence, there are three possible relationships:

A causes B,
B causes A,
OR
C causes both A and B.

In the last case there is a spurious correlation between A and B. In a regression model where A is regressed on B but C is actually the true causal factor for A, this misleading choice of independent variable (B instead of C) is called specification error.

Because correlation can arise from the presence of a lurking variable rather than from direct causation, it is often said that "Correlation does not imply causation".

Contents

General example

An example of a spurious relationship can be illuminated examining a city's ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Another popular example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations.[1]

Detecting spurious relationships

The term "spurious relationship" is commonly used in statistics and in particular in experimental research techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Intervening variables (X → W → Y), if undetected, may make indirect causation look direct. Because of this, experimentally identified correlations do not represent causal relationships unless spurious relationships can be ruled out.

Experiments

In experiments, spurious relationships can often be identified by controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.

Non-experimental statistical analyses

Primarily non-experimental disciplines such as economics usually employ pre-existing data rather than experimental data to establish causal relationships and to determine that they are not spurious. The body of statistical techniques that are used in economics is referred to as econometrics, and involves substantial use of multivariate regression analysis. Typically a linear relationship such as

y = a0 + a1x1 + a2x2 + ... + akxk + e

is postulated, in which y is the dependent variable (hypothesized to be the caused variable), xj for j=1,...,k is the jth independent variable (hypothesized to be a causative variable), and e is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the xjs is caused by y, then estimates of the coefficients aj are obtained. If the null hypothesis that aj = 0 is rejected, then the alternative hypothesis that a_{j} \ne 0 and equivalently that xj causes y cannot be rejected. On the other hand, if the null hypothesis that aj = 0 cannot be rejected, then equivalently the hypothesis of no causal effect of xj on y cannot be rejected. Here the notion of causality is one of contributory causality: If the true value a_j \ne 0, then a change in xj will result in a change in y unless some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in xj is not sufficient to change y. Likewise, a change in xj is not necessary to change y, because a change in y could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).

Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid false inferences of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its affect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say x1 (e.g., x1x2y) is a direct effect (x1y).

Just as an experimenter must be careful to control for every confounding factor, by holding such factors constant throughout the experiment, so also must the user of multiple regression be careful to control for every confounding factor by including them as xj variables in the regression. If a confounding factor is omitted from the regression, it exists by default in the error term, and if the latter is correlated with one (or more) of the included explanators then the regression results may be spurious.

See also

Footnotes

  1. ^ Roger Sapsford, Victor Jupp, ed (2006). Data Collection and Analysis. Sage. ISBN 0-7619-4362-5. 

References

  • Pearl, Judea. Causality: Models, Reasoning and Inference, Cambridge University Press, 2000.
  • Yule, G.U, 1926, "Why do we sometimes get nonsense correlations between time series? A study in sampling and the nature of time series", Journal of the Royal Statistical Society 89, 1–64.

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Spurious — can refer to:tatistics* Spurious correlation or spurious relationshipRadio engineering* Spurious emissionCryptography* Spurious keyLiterature* Spurious quotationComputing* Spurious interrupt * Spurious wakeup …   Wikipedia

  • spurious correlation — A correlation between two variables when there is no causal link between them. A famous spurious correlation often quoted in the literature is that between the number of fire engines at a fire (X) and the amount of damage done (Y). Once the size… …   Dictionary of sociology

  • Entity-relationship model — A sample Entity relationship diagram using Chen s notation In software engineering, an entity relationship model (ERM) is an abstract and conceptual representation of data. Entity relationship modeling is a database modeling method, used to… …   Wikipedia

  • Correlation does not imply causation — (related to ignoring a common cause and questionable cause) is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other (though correlation is necessary for …   Wikipedia

  • Safety in numbers — is the hypothesis that, by being part of a large physical group or mass, an individual is proportionally less likely to be the victim of a mishap, accident, or other bad event.DescriptionEvidence often advanced for this position includes the… …   Wikipedia

  • Gene-environment correlation — Genetic factors influence exposure to many features of the environment. This comes about because people actively shape their experiences according to their personality and behavior, which are heritable. A consequence is that the relationship… …   Wikipedia

  • Job satisfaction — describes how content an individual is with his or her job. The happier people are within their job, the more satisfied they are said to be. Job satisfaction is not the same as motivation, although it is clearly linked. Job design aims to enhance …   Wikipedia

  • Confounding — factor redirects here. For other uses, see Confounding factor (disambiguation). In statistics, a confounding variable (also confounding factor, lurking variable, a confound, or confounder) is an extraneous variable in a statistical model that… …   Wikipedia

  • Internal validity — is the validity of (causal) inferences in scientific studies, usually based on experiments as experimental validity [ Mitchell, M. and Jolley, J. (2001). Research Design Explained (4th Ed) New York:Harcourt.] . Details Inferences are said to… …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”