CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential
statistical methods based on these distinctions, and interpret the results. LO 4.4: Using appropriate graphical displays and/or numerical measures, describe the distribution of a quantitative variable in context: a) describe the overall pattern, b) describe striking deviations from the pattern LO 4.7: Define and describe the features of the distribution of one quantitative variable (shape, center, spread, outliers). So far we have learned about different ways to quantify the center of a distribution. A measure of center by itself is not enough, though, to describe a distribution. Consider the following two distributions of exam scores. Both distributions are centered at 70 (the median of both distributions is approximately 70), but the distributions are quite different. The first distribution has a much larger variability in scores compared to the second one. In order to describe the distribution, we therefore need to supplement the graphical display not only with a measure of center, but also with a measure of the variability (or spread) of the distribution. In this section, we will discuss the three most commonly used measures of spread:
Although the measures of center did approach the question differently, they do attempt to measure the same point in the distribution and thus are comparable. However, the three measures of spread provide very different ways to quantify the variability of the distribution and do not try to estimate the same quantity. In fact, the three measures of spread provide information about three different aspects of the spread of the distribution which, together, give a more complete picture of the spread of the distribution. RangeLO 4.11: Define and calculate the range of one quantitative variable. The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).
Note: When we first looked at the histogram, and tried to get a first feel for the spread of the data, we were actually approximating the range, rather than calculating the exact range. EXAMPLE: Best Actress Oscar WinnersHere we have the Best Actress Oscar winners’ data 34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33 In this example:
The range covered by all the data is 80 – 21 = 59 years. Inter-Quartile Range (IQR)LO 4.12: Define and calculate Q1, Q3, and the IQR for one quantitative variable While the range quantifies the variability by looking at the range covered by
ALL the data,
The following picture illustrates this idea: (Think about the horizontal line as the data ranging from the min to the Max). IMPORTANT NOTE: The “lines” in the following illustrations are not to scale. The equal distances indicate equal amounts of data NOT equal distance between the numeric values. Although we will use software to calculate the quartiles and IQR, we will illustrate the basic process to help you fully understand. To calculate the IQR:
Comments:
Note that when n is odd (as in n = 7 above), the median is not included in either the bottom or top half of the data; When n is even (as in n = 8 above), the data are naturally divided into two halves. EXAMPLE: Best Actress Oscar WinnersTo find the IQR of the Best Actress Oscar winners’ distribution, it will be convenient to use the stemplot. Q1 is the median of the bottom half of the data. Since there are 16 observations in that half, Q1 is the mean of the 8th and 9th ranked observations in that half: Q1 = (31 + 33) / 2 = 32 Similarly, Q3 is the median of the top half of the data, and since there are 16 observations in that half, Q3 is the mean of the 8th and 9th ranked observations in that half: Q3 = (41 + 42) / 2 = 41.5 IQR = 41.5 – 32 = 9.5 Note that in this example, the range covered by all the ages is 59 years, while the range covered by the middle 50% of the ages is only 9.5 years. While the whole dataset is spread over a range of 59 years, the middle 50% of the data is packed into only 9.5 years. Looking again at the histogram will illustrate this: Comment:
R: Minitab: Excel: Q1 and Q3 as reported by the various software packages differ from each other and are also slightly different from the ones we found here. This should not worry you. There are different acceptable ways to find the median and the quartiles. These can give different results occasionally, especially for datasets where n (the number of observations) is fairly small. As long as you know what the numbers mean, and how to interpret them in context, it doesn’t really matter much what method you use to find them, since the differences are negligible. Standard DeviationLO 4.13: Define and calculate the standard deviation and variance of one quantitative variable. So far, we have introduced two measures of spread; the range (covered by all the data) and the inter-quartile range (IQR), which looks at the range covered by the middle 50% of the distribution. We also noted that the IQR should be paired as a measure of spread with the median as a measure of center. We now move on to another measure of spread, the standard deviation, which quantifies the spread of a distribution in a completely different way. IdeaThe idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean. The standard deviation gives the average (or typical distance) between a data point and the mean. NotationThere are many notations for the standard deviation: SD, s, Sd, StDev. Here, we’ll use SD as an abbreviation for standard deviation, and use s as the symbol. FormulaThe sample standard deviation formula is: where, s = sample standard deviation n = number of scores in sample = sum of…and = sample meanCalculationIn order to get a better understanding of the standard deviation, it would be useful to see an example of how it is calculated. In practice, we will use a computer to do the calculation. EXAMPLE: Video Store CustomersThe following are the number of customers who entered a video store in 8 consecutive hours: 7, 9, 5, 13, 3, 11, 15, 9 To find the standard deviation of the number of hourly customers:
(7 + 9 + 5 + 13 + 3 + 11 + 15 + 9)/8 = 9
(7 – 9), (9 – 9), (5 – 9), (13 – 9), (3 – 9), (11 – 9), (15 – 9), (9 – 9) -2, 0, -4, 4, -6, 2, 6, 0
(-2)2, (0)2, (-4)2, (4)2, (-6)2, (2)2, (6)2, (0)2 4, 0, 16, 16, 36, 4, 36, 0
(4 + 0 + 16 + 16 + 36 + 4 + 36 + 0)/(8 – 1) (112)/(7) = 16
s = 4
Recall that the average of the number of customers who enter the store in an hour is 9. The interpretation of the standard deviation is that on average, the actual number of customers who enter the store each hour is 4 away from 9. Comment: The importance of the numerical figure that we found in #4 above called the variance (=16 in our example) will be discussed much later in the course when we get to the inference part. Properties of the Standard Deviation
The last comment leads to the following very important conclusion: Choosing Numerical Measures LO 4.10: Choose the appropriate measures for a quantitative variable based upon the shape of the distribution.
Let’s Summarize
What is meant by the spread of a distribution?The spread is the expected amount of variation associated with the output. This tells us the range of possible values that we would expect to see. Shape. The shape shows how the variation is distributed about the location.
What is the spread of the data?What are measures of spread? Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.
Which type of plot shows the median and the data spread about the median?A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median.
Which of the following is a measure of variability in a distribution of scores?The standard deviation is considered as the best measure of the variability.
|