Numerical summaries and box plots

Section 2.2 Numerical summaries and box plots

Subsection 2.2.1 Measures of center

OpenIntro: Examining Numerical Data video

Figure 2.2.1 Examining Numerical Data video

In the previous section, we saw that modes can occur anywhere in a data set. Therefore, mode is not a measure of center. We understand the term center intuitively, but quantifying what is the center can be a little more challenging. This is because there are different definitions of center. Here we will focus on the two most common: the mean and median.

The mean, sometimes called the average, is a common way to measure the center of a distribution of data. To find the mean number of characters in the 50 emails, we add up all the character counts and divide by the number of emails. For computational convenience, the number of characters is listed in the thousands and rounded to the first decimal.

\begin{gather} \bar{x} = \frac{21.7 + 7.0 + \cdots + 15.8}{50} = 11.6\label{sampleMeanEquation}\tag{2.2.1} \end{gather}

The sample mean is often labeled $\bar{x}\text{.}$ The letter $x$ is being used as a generic placeholder for the variable of interest, num_char, and the bar on the $x$ communicates that the average number of characters in the 50 emails was 11,600.

Mean

The sample mean of a numerical variable is computed as the sum of all of the observations divided by the number of observations:

\begin{gather} \bar{x} = \frac{1}{n}\sum{x_{i}} = \frac{x_1+x_2+\cdots+x_n}{n}\label{meanEquation}\tag{2.2.2} \end{gather}

where $\sum$ is the capital Greek letter sigma and $\sum{x_{i}}$ means take the sum of all the individual $x$ values. $x_1, x_2, \dots, x_n$ represent the $n$ observed values.

Guided Practice 2.2.2

Examine Equation (2.2.1) and Equation (2.2.2) above. What does $x_1$ correspond to? And $x_2\text{?}$ What does $x_i$ represent?¹$x_1$ corresponds to the number of characters in the first email in the sample (21.7, in thousands), $x_2$ to the number of characters in the second email (7.0, in thousands), and $x_i$ corresponds to the number of characters in the $i^{th}$ email in the data set.

Guided Practice 2.2.3

What was $n$ in this sample of emails? ²The sample size was $n=50\text{.}$

The email50 data set represents a sample from a larger population of emails that were received in January and March. We could compute a mean for this population in the same way as the sample mean, however, the population mean has a special label: $\mu\text{.}$ The symbol $\mu$ is the Greek letter mu and represents the average of all observations in the population. Sometimes a subsript, such as $_x\text{,}$ is used to represent which variable the population mean refers to, e.g. $\mu_x\text{.}$

Example 2.2.4

The average number of characters across all emails can be estimated using the sample data. Based on the sample of 50 emails, what would be a reasonable estimate of $\mu_x\text{,}$ the mean number of characters in all emails in the email data set? (Recall that email50 is a sample from email.)

\({\bar{\text{x} }}\)	Mean	`minX`	Minimum
\({\Sigma x}\)	Sum of all the data values	\({Q_1}\)	First quartile
\({\Sigma x^2}\)	Sum of all the squared data values	`Med`	Median
\({\sigma x}\)	Population standard deviation	`maxX`	Maximum
`n`	Sample size or # of data points


	robust		not robust
scenario	median	IQR	\(\bar{x}\)	\(s\)

original `num_char` data	6,890	12,875	11,600	13,130
drop 64,401 observation	6,768	11,702	10,521	10,798
move 64,401 to 150,000	6,890	12,875	13,310	22,434

		population gain
41.2	33.1	30.4	37.3	79.1	34.5
22.9	39.9	31.4	45.1	50.6	59.4
47.9	36.4	42.2	43.2	31.8	36.9
50.1	27.3	37.5	53.5	26.1	57.2
57.4	42.6	40.6	48.8	28.1	29.4
43.8	26	33.8	35.7	38.5	42.3
41.3	40.5	68.3	31	46.7	30.5
68.3	48.3	38.7	62	37.6	32.2
42.6	53.6	50.7	35.1	30.6	56.8
66.4	41.4	34.3	38.9	37.3	41.7
51.9	83.3	46.3	48.4	40.8	42.6
44.5	34	48.7	45.2	34.7	32.2
39.4	38.6	40	57.3	45.2	33.1
43.8	71.7	45.1	32.2	63.3	54.7
71.3	36.3	36.4	41	37	66.7
50.2	45.8	45.7	60.2	53.1
35.8	40.4	51.5	66.4	36.1

	no gain
40.3	33.5	34.8
29.5	31.8	41.3
28	39.1	42.8
38.1	39.5	22.3
43.3	37.5	47.1
43.7	36.7	36
35.8	38.7	39.8
46	42.3	48.2
38.6	31.9	31.1
37.6	29.3	30.1
57.5	32.6	31.1
46.2	26.5	40.1
38.4	46.7	25.9
36.4	41.5	45.7
39.7	37	37.7
21.4	29.3	50.1
43.6	39.8

Section 2.2 Numerical summaries and box plots

Subsection 2.2.1 Measures of center

OpenIntro: Examining Numerical Data video

Mean

Guided Practice 2.2.2

Guided Practice 2.2.3

Example 2.2.4

Example 2.2.5

Median: the number in the middle

Example 2.2.7

The mean follows the tail

Guided Practice 2.2.8

Subsection 2.2.2 Standard deviation as a measure of spread

Example 2.2.9

Calculating the standard deviation

TIP: thinking about the standard deviation

Guided Practice 2.2.12

Example 2.2.13

Example 2.2.14

Subsection 2.2.3 Box plots and quartiles

Interquartile range (IQR)

Outliers in the context of a box plot

Example 2.2.17

Example 2.2.18

Guided Practice 2.2.19

Example 2.2.20

Subsection 2.2.4 Calculator: summarizing 1-variable statistics

TI-83/84: Entering data

Casio fx-9750GII: Entering data

TI-84: Calculating Summary Statistics

TI-83/84: Drawing a box plot

Casio fx-9750GII: Drawing a box plot and 1-variable statistics

Example 2.2.21

TI-83/84: What to do if you cannot find L1 or another list

Casio fx-9750GII: Deleting a data lists

Subsection 2.2.5 Outliers and robust statistics

Rules of thumb for identifying outliers

Guided Practice 2.2.22

Guided Practice 2.2.23

Guided Practice 2.2.26

Example 2.2.27

Guided Practice 2.2.28

Subsection 2.2.6 Linear transformations of data

Example 2.2.29

Example 2.2.30

Example 2.2.31

Adding shifts the values, multiplying stretches or contracts them

Example 2.2.33

Subsection 2.2.7 Comparing numerical data across groups

Guided Practice 2.2.37

TIP: comparing distributions

Guided Practice 2.2.38

Guided Practice 2.2.39

TIP: comparing distributions versus looking at association

Subsection 2.2.8 Mapping data (special topic)

Example 2.2.44

Guided Practice 2.2.45

TI-83/84: What to do if you cannot find `L1` or another list