Exploratory data analysis

Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyzing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses"And roughly the only mechanism for suggesting questions is exploratory. And once they’re suggested, the only appropriate question would be how strongly supported are they and particularly how strongly supported are they by new data. And that’s confirmatory.", A conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler, Statistical Science Volume 15, Number 1 (2000), 79-94.] . It was so named by John Tukey.

EDA development

Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

The objectives of EDA are to:
*Suggest hypotheses about the causes of observed phenomena
*Assess assumptions on which statistical inference will be based
*Support the selection of appropriate statistical tools and techniques
*Provide a basis for further data collection through surveys or experiments

Tukey's books were notoriously opaque, and so several attempts were made to popularise his EDA ideas. Prominent among these was the Statistics in Society (MDST242) course of The Open University.

Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking. [Konold, C. (1999). Statistics goes to school. "Contemporary Psychology", "44(1)", 81-82.]

Techniques

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques."Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques, and should be so taught.", John W. Tukey, We need both exploratory and confirmatory, "The American Statistician", "34(1)", (Feb., 1980), pp. 23-25.]

The principal graphical techniques used in EDA are:

*Box plot
*Histogram
*MultiVari chart
*Run chart
*Pareto chart
*Scatter plot
*Stem-and-leaf plot

The principal quantitative techniques are:

*Median polish
* the Trimean
*Letter values
*Resistant line
*Resistant smooth
*Rootogram

Graphical and quantitative techniques are:

*Multidimensional scaling
*Ordination

History

Many EDA ideas can be traced back to earlier authors, for example:
* Francis Galton emphasized order statistics and quantiles.
* Arthur Bowley used precursors of the stemplot and five-number summary (Bowley actually used a "seven-figure summary", including the extremes, deciles and quartiles, along with the median - see his "Elementary Manual of Statistics" (3rd edn., 1920), p.62 - he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").
* Andrew Ehrenberg articulated a philosophy of data reduction (see his book of the same name).

The Open University course "Statistics in Society" (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.

For details of the above, see John Bibby's book "HOTS: History of Teaching Statistics".

oftware

* CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN source for EDA tools with English-style command syntax, 1977).
* Data Desk, an EDA package from [http://www.datadesk.com/ Data Description] of Ithaca, New York.
* Fathom (for high-school and intro college courses).
* JMP, an EDA package from SAS Institute.
* LiveGraph (free real-time data series plotter).
* TinkerPlots (for upper elementary and middle school students).
* SOCR provides a large number of free Internet-accessible [http://socr.ucla.edu/htmls/SOCR_Charts.html tools for EDA] .

ee also

*Anscombe's quartet, on importance of exploration
*Predictive analytics
*Structured data analysis (statistics)

Bibliography

*cite book |last=Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) |first= |authorlink= |coauthors= |title=Exploring Data Tables, Trends and Shapes |year=1985 |publisher= |location= |id=ISBN 0-471-09776-4
*cite book |editor=|last=Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) |first= |authorlink= |coauthors= |title=Understanding Robust and Exploratory Data Analysis |year=1983 |publisher= |location= |id=ISBN 0-471-09777-2
*cite book |last=Tukey |first=John Wilder |authorlink= |coauthors= |editor= |others= |title=Exploratory Data Analysis |origdate= |origyear= |origmonth= |url= |format= |accessdate= |accessyear= |accessmonth= |edition= |date= |year=1977 |month= |publisher=Addison-Wesley |location= |language= |id= ISBN 0-201-07616-0 |doi = |pages= |chapter= |chapterurl= |quote =
*Velleman, P F & Hoaglin, D C (1981) "Applications, Basics and Computing of Exploratory Data Analysis" ISBN 0-87150-409-X

Notes

References

*Leinhardt, G., Leinhardt, S., "Exploratory Data Analysis: New Tools for the Analysis of Empirical Data", Review of Research in Education, Vol. 8, 1980 (1980), pp. 85-157.

External links

* [http://visalix.xrce.xerox.com Visalix] (free interactive web application for EDA)
* [http://www.datadesk.com DataDesk] (free-to-try commercial EDA software for Mac and PC)
* [http://www.ggobi.org/ GGobi] (free interactive multivariate visualization software linked to R)
* [http://stats.math.uni-augsburg.de/Manet/ MANET] (free Mac-only interactive EDA software)
* [http://www.miner3D.com Miner3D] (EDA and visualization software)
* [http://www.rosuda.org/Mondrian/ Mondrian] (free interactive software for EDA)
* [http://www.ailab.si/Orange/ Orange] (free component-based software for interactive EDA and machine learning)
* [http://www.visualstats.org ViSta] (free interactive software based on Xlisp-Stat for EDA)
* [http://www.VisuMap.net/ VisuMap] (EDA software for high dimensional non-linear data)
* [http://www.inf.ethz.ch/personal/hinterbe/Visulab/ Visulab] (free interactive software for high dimensional non-spatial / non-temporal data with interactive EDA and visualization)
* [http://www.cs.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.html XLisp-Stat] (free software and Lisp based EDA development framework for Mac, PC and X Window)
* [http://www.wolfram.com/products/applications/eda/ Experimental Data Analyst] Mathematica application package for EDA
* [http://factominer.free.fr/ FactoMineR] (free exploratory multivariate data analysis software linked to R)


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Data analysis — Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,… …   Wikipedia

  • Data Desk — Developer(s) Data Description Inc. Stable release 6.2 (Win), 6.2 (Mac) Operating system Windows Mac Type Statistical analysis …   Wikipedia

  • Exploratory search — is a specialization of information exploration, which represents the activities carried out by searchers who are either [1] : * a) unfamiliar with the domain of their goal (ie need to learn about the topic in order to understand how to achieve… …   Wikipedia

  • Data reduction — is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form. Columns and rows are moved around until a diagonal pattern appears, thereby making it easy… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data visualization — A data visualization of Wikipedia as part of the World Wide Web, demonstrating hyperlinks Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including… …   Wikipedia

  • Data warehouse — Overview In computing, a data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations… …   Wikipedia

  • Principal component analysis — PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by… …   Wikipedia

  • Principal components analysis — Principal component analysis (PCA) is a vector space transform often used to reduce multidimensional data sets to lower dimensions for analysis. Depending on the field of application, it is also named the discrete Karhunen Loève transform (KLT),… …   Wikipedia

  • Configural frequency analysis — (CFA) (Lienert, 1969) is a method of exploratory data analysis. The goal of a configural frequency analysis is to detect patterns in the data that occur significantly more (such patterns are called Types) or significantly less often (such… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”