Missing data

Missing data: In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Contents

1 Types of missing data

2 Techniques of dealing with missing data

2.1 Imputation

2.2 Partial imputation

2.3 Partial deletion

2.4 Full analysis

3 See also

4 References

5 Further reading

6 External links

6.1 Background

6.2 Software

Types of missing data

Missing data can occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit. Some items are more sensitive for nonresponse than others, for example items about private subjects such as income.

Dropout is a type of missingness that occurs mostly when studying development over time. In this type of study the measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.

Sometimes missing values are caused by the researchers themselves. If data collection was not done properly or if mistakes were made with the data entry (Ader, H.J., Mellenbergh, G.J. 2008).

And a great deal of missing data arise in cross-national research in economics, sociology, and political science because governments choose not to, or fail to, report critical statistics for one or more years (Messner 1992).

It is important to question why the data is missing, this can help with finding a solution to the problem. If the values are missing at random there is still information about each variable in each unit but if the values are missing systematically the problem is more severe because the sample cannot be representative of the population. For example: a research is done about the relation between IQ and income. If participants with an over average IQ do not answer the question ‘What is your salary?’ the results of the research may show that there is no association between IQ and salary, while in fact there is a relationship. Because of these problems, methodologists routinely advise researchers to design research so as to minimize the incidence of missing values (Ader, H.J., Mellenbergh, G.J. 2008).

Techniques of dealing with missing data

Missing data reduce the representativeness of the sample and can therefore distort inferences about the population. If it is possible try to think about how to prevent data from missingness before the actual data gathering takes place. For example in computer questionnaires it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire. And in survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds (Stoop et al. 2010: 161-187). However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort (Stoop et al. 2010: 188-198).

In situations where missing data are likely to occur, the researcher is often advised to plan to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.

Imputation

Main article: Imputation (statistics)

If it is known that the data analysis technique which is to be used isn't content robust, it is good to consider imputing the missing data. This can be done in several ways. Recommended is to use multiple imputations. Rubin argued^{[citation needed]} that even with a small number, m, of repeated imputations (m being equal or smaller than 5) the quality of estimation improves enormously (in: Ader, H.J., Mellenbergh, G.J. 2008). For most practical purposes 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, low values of m can lead to a substantial loss of statistical power, and some scholars now recommend that m be set to values from 20 to 100 or more (Graham, Olchowski, and Gilreath 2007). Obviously, any multiply imputed data analysis has to be repeated for each of the m imputed data sets and, in some cases, the relevant statistics have to be combined in a relatively complicated way (Ader, H.J., Mellenbergh, G.J. 2008). Examples of imputations are:

Partial imputation

The expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.

Partial deletion

Methods which involve reducing the data available to a dataset having no missing values include:

Listwise deletion/casewise deletion (albeit a naive solution)

Pairwise deletion(albeit a naive solution)

Full analysis

Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:

The expectation-maximization algorithm

full information maximum likelihood estimation

See also

Censoring (statistics)

indicator variable

References

Adèr, H.J.(2008). "Chapter 13: Missing data". In Adèr, H.J., & Mellenbergh, G.J. (Eds.) (with contributions by Hand, D.J.), Advising on Research Methods: A consultant's companion (pp. 305-332). Huizen, The Netherlands: Johannes van Kessel Publishing. ISBN 9079418013

Graham, J.W., Olchowski, A.E., and Gilreath, T.D. (2007) "How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory". Preventative Science 8 (3), 208-213 doi:10.1007/s11121-007-0070-9

Messner, SF. (1992) Exploring the Consequences of Erratic Data Reporting for Cross-National Research on Homicide.Journal of Quantitative Criminology 8 (2), pp. 155-173.

Stoop, I., Billiet, J., Koch, A., and Fitzgerald, R. (2010) Improving Survey Response: Lessons Learned from the European Social Survey. Wiley. ISBN 0470516690

Zarate LE, Nogueira BM, Santos TRA, Song MAJ (2006). "Techniques for Missing Value Recovering in Imbalanced Databases: Application in a Marketing Database with Massive Missing Data". IEEE International Conference on Systems, Man and Cybernetics, 2006. SMC '06.. 3. pp. 2658–64. doi:10.1109/ICSMC.2006.385265. http://ieeexplore.ieee.org/xpls/abs_all.jsp?tp=&arnumber=4274271&isnumber=4274116.

Further reading

Rubin, Donald B.; Little, Roderick J. A. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley. ISBN 0-471-18386-5.

Enders, Craig K. (2010). Applied Missing Data Analysis (1st ed.). New York: Guildford Press. ISBN 978-1-60623-639-0.

Allison, Paul D. (2001). Missing Data (1st ed.). Thousand Oaks: Sage Publications, Inc. ISBN 978-0761916727.

Acock AC (2005). "'Working With Missing Values". Journal of Marriage and Family 67 (4): 1012–28. doi:10.1111/j.1741-3737.2005.00191.x. http://www3.interscience.wiley.com/journal/118686888/abstract.

Van den Broeck J, Cunningham SA, Eeckels R, Herbst K (October 2005). "Data cleaning: detecting, diagnosing, and editing data abnormalities". PLoS Med. 2 (10): e267. doi:10.1371/journal.pmed.0020267. PMC 1198040. PMID 16138788. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1198040.

Schafer, J. L.; Graham, J. W. (2002). "Missing data: Our view of the state of the art". Psychological Methods 7 (2): 147–177. doi:10.1037/1082-989X.7.2.147. PMID 12090408. – edit

Graham, John W. (2009). "Missing Data Analysis: Making It Work in the Real World". Annual review of psychology 60: 549–576.

Rubin DB (1976). "Inference and missing data". Biometrika 63 (3): 581–92. doi:10.1093/biomet/63.3.581. http://biomet.oxfordjournals.org/content/63/3/581.short.

External links

Background

Missing values-envision

psychwiki.com: Missing Values, Identifying Missing Values, and Dealing with Missing Values

missingdata.org.uk, Medical Statistics Unit, London School of Hygiene & Tropical Medicine

Software

Mplus

PROC MI and PROC MIANALYZE - SAS

SPSS

Categories:
Statistical data types
Data analysis
Missing data

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

Data quality assessment — is the process of exposing technical and business data issues in order to plan data cleansing and data enrichment strategies. Technical quality issues are generally easy to discover and correct, such as • Inconsistent standards in structure,… … Wikipedia
Data quality assurance — is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality . These activities can be… … Wikipedia
Data analysis — Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,… … Wikipedia
Missing completely at random — In statistical analysis, data values in a data set are missing completely at random (MCAR) if the events that lead to any particular data item being missing are independent both of observable variables and of unobservable parameters of interest.… … Wikipedia
Missing values — In statistics, missing values are a common occurrence. Several statistical methods have been developed to deal with this problem. Missing values mean that no data value is stored for the variable in the current observation. Modern statistical… … Wikipedia
Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… … Wikipedia
Data entry clerk — Example of a legitimate data entry clerk A data entry clerk, sometimes called a typist, is a member of staff employed to type data into a database using a keyboard. The keyboards used can often have specialist keys and multiple colours to help… … Wikipedia
Data erasure — (also called data clearing or data wiping) is a software based method of overwriting data that completely destroys all electronic data residing on a hard disk drive or other digital media. Permanent data erasure goes beyond basic file deletion… … Wikipedia
Data consistency — summarizes the validity, accuracy, usability and integrity of related data between applications and across an IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user s own… … Wikipedia
Data spill — is a somewhat ironic term, derived from such phrases as oil spill, toxic or hazardous waste spill, etc. , for the unintentional release of secure information to an insecure environment. Other terms for this type of incident are data breach, data… … Wikipedia

Academic Dictionaries and Encyclopedias

Missing data

Contents

Types of missing data

Techniques of dealing with missing data

Imputation

Partial imputation

Partial deletion

Full analysis

See also

References

Further reading

External links

Background

Software

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Missing data

Contents

Types of missing data

Techniques of dealing with missing data

Imputation

Partial imputation

Partial deletion

Full analysis

See also

References

Further reading

External links

Background

Software

Look at other dictionaries:

Share the article and excerpts

Direct link