AHSS Numerical summaries and box plots

Section 2.2 Numerical summaries and box plots

What are the different ways to measure the center of a distribution, and why is there more than one way to measure the center? How do you know if a value is “far” from the center? What does it mean to an outlier? We will continue with the email50 data set and investigate multiple quantitative summarizes for numerical data.

Subsection 2.2.1 Learning objectives

Calculate, interpret, and compare the two measures of center (mean and median) and the three measures of spread (standard deviation, interquartile range, and range).
Understand how the shape of a distribution affects the relationship between the mean and the median.
Identify and apply the two rules of thumb for identify outliers (one involving standard deviation and mean and the other involving $Q_1$ and $Q_3$).
Describe the distribution a numerical variable with respect to center, spread, and shape, noting the presence of outliers.
Find the 5 number summary and IQR, and draw a box plot with outliers shown.
Understand the effect changing units has on each of the summary quantities.
Use the empirical rule to summarize approximately normal distributions.
Use quartiles, percentiles, and Z-scores to measure the relative position of a data point within the data set.
Compare the distribution of a numerical variable using dot plots / histograms with the same scale, back-to-back stem-and-leaf plots, or parallel box plots. Compare the distributions with respect to center, spread, shape, and outliers.

Subsection 2.2.2 Measures of center

In the previous section, we saw that modes can occur anywhere in a data set. Therefore, mode is not a measure of center. We understand the term center intuitively, but quantifying what is the center can be a little more challenging. This is because there are different definitions of center. Here we will focus on the two most common: the mean and median.

The mean, sometimes called the average, is a common way to measure the center of a distribution of data. To find the mean number of characters in the 50 emails, we add up all the character counts and divide by the number of emails. For computational convenience, the number of characters is listed in the thousands and rounded to the first decimal.

\begin{equation} \bar{x} = \frac{21.7 + 7.0 + \cdots + 15.8}{50} = 11.6\label{sampleMeanEquation}\tag{2.2.1} \end{equation}

The sample mean is often labeled $\bar{x}\text{.}$ The letter $x$ is being used as a generic placeholder for the variable of interest, num_char, and the bar on the $x$ communicates that the average number of characters in the 50 emails was 11,600.

Mean.

The sample mean of a numerical variable is computed as the sum of all of the observations divided by the number of observations:

\begin{equation} \bar{x} = \frac{1}{n}\sum{x_{i}} = \frac{\sum{x_i}}{n}=\frac{x_1+x_2+\cdots+x_n}{n}\label{meanEquation}\tag{2.2.2} \end{equation}

where $\sum$ is the capital Greek letter sigma and $\sum{x_{i}}$ means take the sum of all the individual $x$ values. $x_1, x_2, \dots, x_n$ represent the $n$ observed values.

Checkpoint 2.2.1.

Examine (2.2.1) and (2.2.2) above. What does $x_1$ correspond to? And $x_2\text{?}$ What does $x_i$ represent? ¹

$x_{1}$ corresponds to the number of characters in the first email in the sample (21.7, in thousands), $x_{2}$ to the number of characters in the second email (7.0, in thousands), and $x_{i}$ corresponds to the number of characters in the $i^{\text{th}}$ email in the data set.

Checkpoint 2.2.2.

What was $n$ in this sample of emails? ²

The sample size was $n=50\text{.}$

The email50 data set represents a sample from a larger population of emails that were received in January and March. We could compute a mean for this population in the same way as the sample mean, however, the population mean has a special label: $\mu\text{.}$ The symbol $\mu$ is the Greek letter mu and represents the average of all observations in the population. Sometimes a subscript, such as $_x\text{,}$ is used to represent which variable the population mean refers to, e.g. $\mu_x\text{.}$

Example 2.2.3.

The average number of characters across all emails can be estimated using the sample data. Based on the sample of 50 emails, what would be a reasonable estimate of $\mu_x\text{,}$ the mean number of characters in all emails in the email data set? (Recall that email50 is a sample from email.)

\(\bar{x}\)	Mean	`n`	Sample size or # of data points
\(\Sigma x\)	Sum of all the data values	`minX`	Minimum
\(\Sigma x^2\)	Sum of all the squared data values	\(Q_{1}\)	First quartile
\(S_{x}\)	Sample standard deviation	`Med`	Median
\(\sigma x\)	Population standard deviation	`maxX`	Maximum


	robust		not robust
scenario	median	IQR	\(\bar{x}\)	\(s\)

original `num_char` data	6,890	12,875	11,600	13,130
drop 64,401 observation	6,768	11,702	10,521	10,798
move 64,401 to 150,000	6,890	12,875	13,310	22,434

Median Income for 150 Counties, in $1000s

Population Gain						No Population Gain
38.2	43.6	42.2	61.5	51.1	45.7	48.3	60.3	50.7
44.6	51.8	40.7	48.1	56.4	41.9	39.3	40.4	40.3
40.6	63.3	52.1	60.3	49.8	51.7	57	47.2	45.9
51.1	34.1	45.5	52.8	49.1	51	42.3	41.5	46.1
80.8	46.3	82.2	43.6	39.7	49.4	44.9	51.7	46.4
75.2	40.6	46.3	62.4	44.1	51.3	29.1	51.8	50.5
51.9	34.7	54	42.9	52.2	45.1	27	30.9	34.9
61	51.4	56.5	62	46	46.4	40.7	51.8	61.1
53.8	57.6	69.2	48.4	40.5	48.6	43.4	34.7	45.7
53.1	54.6	55	46.4	39.9	56.7	33.1	21	37
63	49.1	57.2	44.1	50	38.9	52	31.9	45.7
46.6	46.5	38.9	50.9	56	34.6	56.3	38.7	45.7
74.2	63	49.6	53.7	77.5	60	56.2	43	21.7
63.2	47.6	55.9	39.1	57.8	42.6	44.5	34.5	48.9
50.4	49	45.6	39	38.8	37.1	50.9	42.1	43.2
57.2	44.7	71.7	35.3	100.2		35.4	41.3	33.6
42.6	55.5	38.6	52.7	63		43.4	56.5


gender	age	maritalStatus	grossIncome	smoke	amtWeekends	amtWeekdays

Female	51	Married	£2,600 to £5,200	Yes	20 cig/day	20 cig/day
Male	24	Single	£10,400 to £15,600	Yes	20 cig/day	15 cig/day
Female	33	Married	£10,400 to £15,600	Yes	20 cig/day	10 cig/day
Female	17	Single	£5,200 to £10,400	Yes	20 cig/day	15 cig/day
Female	76	Widowed	£5,200 to £10,400	Yes	20 cig/day	20 cig/day

Section 2.2 Numerical summaries and box plots

Subsection 2.2.1 Learning objectives

Subsection 2.2.2 Measures of center

Mean.

Checkpoint 2.2.1.

Checkpoint 2.2.2.

Example 2.2.3.

Example 2.2.4.

Median: the number in the middle.

Example 2.2.6.

The mean follows the tail.

Checkpoint 2.2.7.

Subsection 2.2.3 Standard deviation as a measure of spread

Example 2.2.8.

Calculating the standard deviation.

Thinking about the standard deviation.

Checkpoint 2.2.11.

Example 2.2.12.

Example 2.2.13.

Subsection 2.2.4 Z-scores

Example 2.2.14.

The Z-score.

Example 2.2.15.

Checkpoint 2.2.16.

Checkpoint 2.2.17.

Example 2.2.18.

Subsection 2.2.5 Box plots and quartiles

Interquartile range (IQR).

Outliers in the context of a box plot.

Example 2.2.20.

Example 2.2.21.

Checkpoint 2.2.22.

Example 2.2.23.

Subsection 2.2.6 Calculator/Desmos: summarizing 1-variable statistics

TI-83/84: Entering data.

Casio fx-9750GII: Entering data.

TI-84: Calculating Summary Statistics.

TI-83/84: Drawing a box plot.

Casio fx-9750GII: Drawing a box plot and 1-variable statistics.

Example 2.2.25.

TI-83/84: What to do if you cannot find L1 or another list Restore lists L1-L6 using the following steps:.

Casio fx-9750GII: Deleting a data list.

Subsection 2.2.7 Outliers and robust statistics

Rules of thumb for identifying outliers.

Checkpoint 2.2.26.

Checkpoint 2.2.27.

Checkpoint 2.2.30.

Example 2.2.31.

Checkpoint 2.2.32.

Subsection 2.2.8 Linear transformations of data

Example 2.2.33.

Example 2.2.34.

Example 2.2.35.

Adding shifts the values, multiplying stretches or contracts them.

Example 2.2.37.

Subsection 2.2.9 Comparing numerical data across groups

Checkpoint 2.2.41.

Comparing distributions.

Checkpoint 2.2.42.

Checkpoint 2.2.43.

Comparing distributions versus looking at association.

Subsection 2.2.10 Mapping data (special topic)

Example 2.2.44.

Checkpoint 2.2.45.

Subsection 2.2.11 Section summary

Exercises 2.2.12 Exercises

1. Smoking habits of UK residents, Part I.

2. Stats scores, Part I.

3. Smoking habits of UK residents, Part II.

4. Factory defective rate.

5. Days off at a mining plant.

6. Medians and IQRs.

7. Means and SDs.

8. Mix-and-match.

9. Air quality.

10. Median vs. mean.

11. Histograms vs. box plots.

12. Facebook friends.

13. Distributions and appropriate statistics, Part I.

14. Distributions and appropriate statistics, Part II.

TI-83/84: What to do if you cannot find `L1` or another list Restore lists `L1`-`L6` using the following steps:.