- Descriptive statistics
-
Descriptive statistics quantitatively describe the main features of a collection of data.[1] Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory[2]. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities.
Contents
Use in statistical analyses
Descriptive statistics provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of quantitative analysis of data.[citation needed]
Descriptive statistics summarize data. For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. A player who shoots 33% is making approximately one shot in every three. One making 25% is hitting once in four. The percentage summarizes or describes multiple discrete events. Or, consider the scourge of many students, the grade point average. This single number describes the general performance of a student across the range of their course experiences. [3]
Describing a large set of observations with a single indicator risks distorting the original data or losing important detail. For example, the shooting percentage doesn't tell you whether the shots are three-pointers or lay-ups, and GPA doesn't tell you whether the student was in difficult or easy courses. Despite these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units. [3]
Univariate analysis
Univariate analysis involves the examination across cases of a single variable, focusing on three characteristics: the distribution; the central tendency; and the dispersion. It is common to compute all three for each study variable.
Distribution
The distribution is a summary of the frequency of individual or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of cases who had that value. For instance, computing the distribution of gender in the study population means computing the percentages that are male and female. The gender variable has only two, making it possible and meaningful to list each one. However, this does not work for a variable such as income that has many possible values. Typically, specific values are not particularly meaningful (income of 50,000 is typically not meaningfully different from 51,000). Grouping the raw scores using ranges of values reduces the number of categories to something more meaningful. For instance, we might group incomes into ranges of 0-10,000, 10,001-30,000, etc.
Frequency distributions are depicted as a table or as a graph. Table 1 shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 2. This type of graph is often referred to as a histogram or bar chart.
Central tendency
The central tendency of a distribution locates the "center" of a distribution of values. The three major types of estimates of central tendency are the mean, the median, and the mode.
The mean is the most commonly used method of describing central tendency. To compute the mean, take the sum of the values and divide by the count. For example, the mean quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values:
15, 20, 21, 36, 15, 25, 15
The sum of these 7 values is 147, so the mean is 147/7 =21.
The median is the score found at the middle of the set of values, i.e., that has as many cases with a larger value as have a smaller value. One way to compute the median is to sort the values in numerical order, and then locate the value in the middle of the list. For example, if there are 500 values, the median is the average of the two values in 250th and 251st positions. If there are 499 values, the value in 250th position is the median. Sorting the 7 scores above produces:
15, 15, 15, 20, 21, 25, 36
There are 7 scores and score #4 represents the halfway point. The median is 20. If there are an even number of observations, then the median is the mean of the two middle scores. In the example, if there were an 8th observation, with a value of 25, the median becomes the average of the 4th and 5th scores, in this case 20.5.
The mode is the most frequently occurring value in the set. To determine the mode, compute the distribution as above. The mode is the value with the greatest frequency. In the example, the modal value 15, occurs three times. In some distributions there is a "tie" for the highest frequency, i.e., there are multiple modal values. These are called multi-modal distributions.
Notice that the three measures typically produce different results. The term "average" obscures the difference between them and is better avoided. The three values are equal if the distribution is perfectly "normal" (i.e., bell-shaped).
Dispersion
Dispersion is the spread of values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the range is 36 − 15 = 21.
The standard deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands apart from the rest of the values). The standard deviation shows the relation that set of scores has to the mean of the sample. Again let's take the set of scores:
15, 20, 21, 36, 15, 25, 15
to compute the standard deviation, we first find the distance between each value and the mean. We know from above that the mean is 21. So, the differences from the mean are:
- 15 − 21 = −6
- 20 − 21 = −1
- 21 − 21 = 0
- 36 − 21 = 15
- 15 − 21 = −6
- 25 − 21 = +4
- 15 − 21 = −6
Notice that values that are below the mean have negative differences and values above it have positive ones. Next, we square each difference:
- (−6)2 = 36
- (−1)2 = 1
- (+0)2 = 0
- (15)2 = 225
- (−6)2 = 36
- (+4)2 = 16
- (−6)2 = 36
Now, we take these "squares" and sum them to get the sum of squares (SS) value. Here, the sum is 350. Next, we divide this sum by the number of scores minus 1. Here, the result is 350 / 6 = 58.3. This value is known as the variance. To get the standard deviation, we take the square root of the variance (remember that we squared the deviations earlier). This would be √58.3 = 7.63.
Although this computation may seem convoluted, it's actually quite simple. In English, we can describe the standard deviation as:
"the square root of the sum of the squared deviations from the mean divided by the number of scores minus one"
The standard deviation allows us to reach some conclusions about specific scores in our distribution. Assuming that the distribution of scores is close to "normal", the following conclusions can be reached:
-
- approximately 68% of the scores in the sample fall within one standard deviation of the mean
- approximately 95% of the scores in the sample fall within two standard deviations of the mean
- approximately 99% of the scores in the sample fall within three standard deviations of the mean
For instance, since the mean in our example is 21 and the standard deviation is 7.63, we can from the above statement estimate that approximately 95% of the scores will fall in the range of 21 − (2×7.63) to 21 + (2×7.63) or between 5.74 and 36.26. Values beyond two standard deviations from the mean can be considered "outliers". 36 is the only such value in our distribution. Outliers help identify observations for further analysis or possible problems in the observations. Standard deviations also convert measures on very different scales, such as height and weight, into values that can be compared.
Other statistics
In research involving comparisons between groups, emphasis is often placed on the significance level for the hypothesis that the groups being compared differ to a degree greater than would be expected by chance. This significance level is often represented as a p-value, or sometimes as the standard score of a test statistic. In contrast, an effect size conveys the estimated magnitude and direction of the difference between groups, without regard to whether the difference is statistically significant. Reporting significance levels without effect sizes is problematic, since for large sample sizes even small effects of little practical importance can be statistically significant.
Examples of descriptive statistics
Most statistics can be used either as a descriptive statistic, or in an inductive analysis. For example, we can report the average reading test score for the students in each classroom in a school, to give a descriptive sense of the typical scores and their variation. If we perform a formal hypothesis test on the scores, we are doing inductive rather than descriptive analysis.
Some statistical summaries are especially common in descriptive analyses. Some examples follow.
- Measures of central tendency
- Measures of dispersion
- Measures of association
- Cross-tabulation, contingency table
See also
Notes
- ^ (1995) Introductory Statistics, 2nd Edition, Wiley. ISBN 0-471-31009-3
- ^ Dodge, Y (2003) The Oxford Dictionary of Statistical Terms OUP. ISBN 0-19-850994-4
- ^ a b Trochim, William M. K. (2006). "Descriptive statistics". Research Methods Knowledge Base. http://www.socialresearchmethods.net/kb/statdesc.php. Retrieved 14 March 2011.
External links
- Descriptive Statistics Lecture: University of Pittsburgh Supercourse: http://www.pitt.edu/~super1/lecture/lec0421/index.htm
Categories:
Wikimedia Foundation. 2010.