Data-snooping bias

Data-snooping bias

In statistics, data-snooping bias is a form of statistical bias generated by the misuse of data mining techniques which can lead to bogus results in scientific research. Although data-snooping biases can occur in any field that uses data mining, data snooping biases are a particular concern in finance and medical research, both of which make heavy use of data mining techniques.

In the process of data mining, huge numbers of hypotheses about a single data set can be tested in a very short time, by exhaustively searching for combinations of variables that might show a correlation.

Because conventional tests of statistical significance are based on the probability that an observation arose by chance, it is reasonable to expect that 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 0.1% will turn out to be significant at the 0.1% significance level, and so on, simply by chance.

Thus, given enough hypotheses tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who are using data mining techniques can be easily misled by these apparently significant results, even though they are merely chance artifacts.

Data-snooping bias most commonly occurs when researchers have not formed an hypothesis in advance, and therefore are open to any hypothesis suggestions presented by the data; or when researchers narrow the data used in order to reduce the probability of the sample refuting a specific hypothesis.

Examples

Example 1: Hypothesis suggested by data

In a list of 367 people, at least two will have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.

Data snooping would, by design, try to find additional similarities between Mary and John, such as: : Are they the youngest and the oldest persons in the list?: Have they met in person once? Twice? Three times? : Do their fathers have the same first name, or mothers have the same maiden name?

By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we may eventually find proof of virtually any hypothesis.

Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their life's histories. Our data-snooping bias hypothesis can then become, "People born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.

However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole.

Example 2: Narrow sample to match hypothesis

Suppose medical researchers examine a pool of data representing 10,000 lung cancer patients. They want to find information that suggests non-smokers who develop lung cancer have a better chance of survival than smokers with lung cancer.

The researchers notice that 90 percent of the patients (9,000) smoked cigarettes. About 4 percent (360 people) went into remission with no chemotherapy.

Of the 10 percent (1,000) of patients who were not smokers, 40 people -- 4 percent -- also went into remission with no chemotherapy.

The data, as it stands, suggests that smokers are as likely as non-smokers to go into remission without chemotherapy. But the result is not what the researchers desire, so they reduce the sample size to 1,000 patients, to see if that produces different results.

The new data retains the 90 percent smoker rate (900). In this sample, 36 people -- about 4 percent -- go into remission without chemotherapy.

However, the new sample of non-smoking patients (100) retains 16 of the 40 people from the original sample who went into remission without chemotherapy. That is 16 percent of the new sample size.

The researchers therefore claim that non-smokers with lung cancer are four times more likely to go into remission without chemotherapy than smokers are.

By reducing the sample size without regard to statistical significance, after the original sample suggested there is no difference in untreated remission rates, the researchers have produced numbers that seem to bear out the desired result.

External links

* [http://data-snooping.martinsewell.com/ A bibliography on data-snooping bias]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Data dredging — (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. Data snooping bias is a form of statistical bias that arises from this misuse of statistics. Any… …   Wikipedia

  • Bias (statistics) — In statistics, the term bias is used for describing several different concepts: * A biased sample is one in which some members of the population are more likely to be included than others. **Spectrum bias refers to evaluating the ability of a… …   Wikipedia

  • Testing hypotheses suggested by the data — In statistics, hypotheses suggested by the data must be tested differently from hypotheses formed independently of the data.How to do it wrongFor example, suppose fifty different researchers, unaware of each other s work, run clinical trials to… …   Wikipedia

  • Technical analysis — Financial markets Public market Exchange Securities Bond market Fixed income Corporate bond Government bond Municipal bond …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • Michael Brennan Award — The BlackRock Michael Brennan Award is an annual prize given to authors in recognition of important finance research papers published in the Review of Financial Studies (RFS). It is given annually to the best paper published in The Review of… …   Wikipedia

  • Database Forensics — is a computer science term referring to the forensic study of databases. Definition of Computer forensics:“Gathering and analyzing data in a manner as free from distortion or bias as possible to reconstruct data or what has happened in the past… …   Wikipedia

  • e-Government — (short for electronic government, also known as e gov, digital government, online government, or connected government) is digital interactions between a government and citizens (G2C), government and businesses/Commerce (G2B), government and… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”