## Section4.2Sampling distribution of a sample mean

If bags of chips are produced with an average weight of 15 oz and a standard deviation of 0.1 oz, what is the probability that the average weight of 30 bags will be within 0.1 oz of the mean? The answer is not 68%! To answer this question we must visualize and understand what is called the sampling distribution of a sample mean.

### Subsection4.2.1Learning objectives

1. Understand the concept of a sampling distribution.

2. Describe the center, spread, and shape of the sampling distribution of a sample mean.

3. Distinguish between the standard deviation of a population and the standard deviation of a sampling distribution.

4. Explain the content and importance of the Central Limit Theorem.

5. Identify and explain the conditions for using normal approximation involving a sample mean.

6. Verify that the conditions for normal approximation are met and carry out normal approximation involving a sample mean or sample sum.

### Subsection4.2.2The mean and standard deviation of $\bar{x}$

In this section we consider a data set called run17, which represents all 19,961 runners who finished the 2017 Cherry Blossom 10 mile run in Washington, DC. 1  Part of this data set is shown in Table 4.2.1, and the variables are described in Table 4.2.2.

These data are special because they include the results for the entire population of runners who finished the 2017 Cherry Blossom Run. We took a simple random sample of this population, which is represented in Table 4.2.3. A histogram summarizing the time variable in the run17samp data set is shown in Figure 4.2.4.

From the random sample represented in run17samp, we guessed the average time it takes to run 10 miles is 95.61 minutes. Suppose we take another random sample of 100 individuals and take its mean: 95.30 minutes. Suppose we took another (93.43 minutes) and another (94.16 minutes), and so on. If we do this many many times — which we can do only because we have the entire population data set — we can build up a sampling distribution for the sample mean when the sample size is 100, shown in Figure 4.2.5.

###### Sampling distribution.

The sampling distribution represents the distribution of the point estimates based on samples of a fixed size from a certain population. It is useful to think of a point estimate as being drawn from such a distribution. Understanding the concept of a sampling distribution is central to understanding statistical inference.

The sampling distribution shown in Figure 4.2.5 is unimodal and approximately symmetric. It is also centered exactly at the true population mean: $\mu=94.52\text{.}$ Intuitively, this makes sense. The sample mean should be an unbiased estimator of the population mean. Because we are considering the distribution of the sample mean, we will use $\mu_{\bar{x}} = 94.52$ to describe the true mean of this distribution.

We can see that the sample mean has some variability around the population mean, which can be quantified using the standard deviation of this distribution of sample means. The standard deviation of the sample mean tells us how far the typical estimate is away from the actual population mean, 94.52 minutes. It also describes the typical error of a single estimate, and is denoted by the symbol $\sigma_{\bar{x}}\text{.}$

###### Standard deviation of an estimate.

The standard deviation associated with an estimate describes the typical error or uncertainty associated with the estimate.

Looking at Figure 4.2.4 and Figure 4.2.5, we see that the standard deviation of the sample mean with $n=100$ is much smaller than the standard deviation of a single sample. Interpret this statement and explain why it is true.

Solution

The variation from one sample mean to another sample mean is much smaller than the variation from one individual to another individual. This makes sense because when we average over 100 values, the large and small values tend to cancel each other out. While many individuals have a time under 90 minutes, it would be unlikely for the average of 100 runners to be less than 90 minutes.

(a) Would you rather use a small sample or a large sample when estimating a parameter? Why? (b) Using your reasoning from (a), would you expect a point estimate based on a small sample to have smaller or larger standard deviation than a point estimate based on a larger sample?  2

(a) Consider two random samples: one of size 10 and one of size 1000. Individual observations in the small sample are highly influential on the estimate while in larger samples these individual observations would more often average each other out. The larger sample would tend to provide a more accurate estimate. (b) If we think an estimate is better, we probably mean it typically has less error. Based on (a), our intuition suggests that a larger sample size corresponds to a smaller standard deviation.

When considering how to calculate the standard deviation of a sample mean, there is one problem: there is no obvious way to estimate this from a single sample. However, statistical theory provides a helpful tool to address this issue.

In the sample of 100 runners, the standard deviation of the sample mean is equal to one-tenth of the population standard deviation: $15.93/10 = 1.59\text{.}$ In other words, the standard deviation of the sample mean based on 100 observations is equal to

\begin{gather*} SD_{\bar{x}} = \sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}} = \frac{15.93}{\sqrt{100}} = 1.59 \end{gather*}

where $\sigma_{x}$ is the standard deviation of the individual observations. This is no coincidence. We can show mathematically that this equation is correct when the observations are independent using the probability tools of Section 3.5.

###### Computing SD for the sample mean.

Given $n$ independent observations from a population with standard deviation $\sigma\text{,}$ the standard deviation of the sample mean is equal to

$$SD_{\bar{x}} = \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\label{sd_sample_mean}\tag{4.2.1}$$

A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10% of the population.

The average of the runners' ages is 35.05 years with a standard deviation of $\sigma = 8.97\text{.}$ A simple random sample of 100 runners is taken. (a) What is the standard deviation of the sample mean? (b) Would you be surprised to get a sample of size 100 with an average of 36 years?  3

Use (4.2.1) with the population standard deviation to compute the standard deviation of the sample mean: $SD_{y}=8.97 \sqrt{100}=0.90$ years. (b) It would not be surprising. 36 years is about 1 standard deviation from the true mean of 35.05. Based on the 68, 95 rule, we would get a sample mean at least this far away from the true mean approximately 100%-68% = 32% of the time.

(a) Would you be more trusting of a sample that has 100 observations or 400 observations? (b) We want to show mathematically that our estimate tends to be better when the sample size is larger. If the standard deviation of the individual observations is 10, what is our estimate of the standard deviation of the mean when the sample size is 100? What about when it is 400? (c) Explain how your answer to (b) mathematically justifies your intuition in part (a).  4

(a) Extra observations are usually helpful in understanding the population, so a point estimate with 400 observations seems more trustworthy. (b) The standard deviation of the mean when the sample size is 100 is given by $SD_{100}=10/ \sqrt{100} = 1\text{.}$ For $SD_{400}=10 / \sqrt{400}=0.5\text{.}$ The larger sample has a smaller standard deviation ofthe mean. ( c) The standard deviation of the mean of the sample with 400 observations is lower than that of the sample with 100 observations. The standard deviation of $\bar{x}$ describes the typical error, and since it is lower for the larger sample, this mathematically shows the estimate from the larger sample tends to be better - though it does not guarantee that every large sample will provide a better estimate than a particular small sample.

### Subsection4.2.3Examining the Central Limit Theorem

In Figure 4.2.5, the sampling distribution of the sample mean looks approximately normally distributed. Will the sampling distribution of a mean always be nearly normal? To address this question, we will investigate three cases to see roughly when the approximation is reasonable.

We consider three data sets: one from a uniform distribution, one from an exponential distribution, and the other from a normal distribution. These distributions are shown in the top panels of Figure 4.2.10. The uniform distribution is symmetric, and the exponential distribution may be considered as having moderate skew since its right tail is relatively short (few outliers).

The left panel in the $n=2$ row represents the sampling distribution of $\bar{x}$ if it is the sample mean of two observations from the uniform distribution shown. The dashed line represents the closest approximation of the normal distribution. Similarly, the center and right panels of the $n=2$ row represent the respective distributions of $\bar{x}$ for data from exponential and log-normal distributions.

Examine the distributions in each row of Figure 4.2.10. What do you notice about the sampling distribution of the mean as the sample size, $n\text{,}$ becomes larger?  5

The normal approximation becomes better as larger samples are used. However, in the case when the population is normally distributed, the normal distribution of the sample mean is normal for all sample sizes.

In general, would normal approximation for a sample mean be appropriate when the sample size is at least 30?

Solution

Yes, the sampling distributions when $n = 30$ all look very much like the normal distribution.

However, the more non-normal a population distribution, the larger a sample size is necessary for the sampling distribution to look nearly normal.

###### Determining if the sample mean is normally distributed.

If the population is normal, the sampling distribution of $\bar{x}$ will be normal for any sample size.

The less normal the population, the larger $n$ needs to be for the sampling distribution of $\bar{x}$ to be nearly normal. However, a good rule of thumb is that for almost all populations, the sampling distribution of $\bar{x}$ will be approximately normal if $n \ge 30\text{.}$

This brings us to the Central Limit Theorem, the most fundamental theorem in Statistics.

###### Central Limit Theorem.

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of $\bar{x}$ approaches the normal distribution as $n$ increases.

Sometimes we do not know what the population distribution looks like. We have to infer it based on the distribution of a single sample. Figure 4.2.14 shows a histogram of 20 observations. These represent winnings and losses from 20 consecutive days of a professional poker player. Based on this sample data, can the normal approximation be applied to the distribution of the sample mean?

Solution

We should consider each of the required conditions.

1. These are referred to as time series data, because the data arrived in a particular sequence. If the player wins on one day, it may influence how she plays the next. To make the assumption of independence we should perform careful checks on such data.

2. The sample size is 20, which is smaller than 30.

3. There are two outliers in the data, both quite extreme, which suggests the population may not be normal and instead may be very strongly skewed or have distant outliers. Outliers can play an important role and affect the distribution of the sample mean and the estimate of the standard deviation of the sample mean.

Since we should be skeptical of the independence of observations and the extreme upper outliers pose a challenge, we should not use the normal model for the sample mean of these 20 observations. If we can obtain a much larger sample, then the concerns about skew and outliers would no longer apply.

###### Examine data structure when considering independence.

Some data sets are collected in such a way that they have a natural underlying structure between observations, e.g. when observations occur consecutively. Be especially cautious about independence assumptions regarding such data sets.

###### Watch out for strong skew and outliers.

Strong skew in the population is often identified by the presence of clear outliers in the data. If a data set has prominent outliers, then a larger sample size will be needed for the sampling distribution of $\bar{x}$ to be normal. There are no simple guidelines for what sample size is big enough for each situation. However, we can use the rule of thumb that, in general, an $n$ of at least 30 is sufficient for most cases.

### Subsection4.2.4Normal approximation for the sampling distribution of $\bar{x}$

At the beginning of this chapter, we used normal approximation for populations or for data that had an approximately normal distribution. When appropriate conditions are met, we can also use the normal approximation to estimate probabilities about a sample average. We must remember to verify that the conditions are met and use the mean $\mu_{\bar{x}}$ and standard deviation $\sigma_{\bar{x}}$ for the sampling distribution of the sample average.

Three important facts about the distribution of a sample mean $\bar{x}$ Consider taking a simple random sample from a large population.

1. The mean of a sample mean is denoted by $\mu_{\bar{x}}\text{,}$ and it is equal to $\mu\text{.}$

2. The SD of a sample mean is denoted by $\sigma_{\bar{x}}\text{,}$ and it is equal to $\frac{\sigma}{\sqrt{n}}\text{.}$

3. When the population is normal or when $n\ge 30\text{,}$ the sample mean closely follows a normal distribution.

In the 2012 Cherry Blossom 10 mile run, the average time for all of the runners is 94.52 minutes with a standard deviation of 8.97 minutes. The distribution of run times is approximately normal. Find the probabiliy that a randomly selected runner completes the run in less than 90 minutes.

Solution

Because the distribution of run times is approximately normal, we can use normal approximation.

\begin{align*} \amp Z = \frac{x - \mu}{\sigma}=\frac{90-94.52}{8.97}=-0.504\\ \amp P(Z \lt -0.504) = 0.3072 \end{align*}

There is a 30.72% probability that a randomly selected runner will complete the run in less than 90 minutes.

Find the probabiliy that the average of 20 runners is less than 90 minutes.

Solution

Here, $n=20\lt 30\text{,}$ but the distribution of the population, that is, the distribution of run times is stated to be approximately normal. Because of this, the sampling distribution will be normal for any sample size.

\begin{align*} \amp \sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}=\frac{8.97}{\sqrt{20}}=2.01\\ \amp Z = \frac{\bar{x} - \mu_{\bar{x}}}{\sigma_{\bar{x}}}=\frac{90-94.52}{2.01}=-2.25\\ \amp P(Z \lt -2.25) = 0.0123 \end{align*}

There is a 1.23% probability that the average run time of 20 randomly selected runners will be less than 90 minutes.

The average of all the runners' ages is 35.05 years with a standard deviation of $\sigma = 8.97\text{.}$ The distribution of age is somewhat skewed. What is the probability that a randomly selected runner is older than 37 years?

Solution

Because the distribution of age is skewed and is not normal, we cannot use normal approximation for this problem. In order to answer this question, we would need to look at all of the data.

What is the probability that the average of 50 randomly selected runners is greater than 37 years?  6

Because $n = 50 \ge 30\text{,}$ the sampling distribution of the mean is approximately normal, so we can use normal approximation for this problem. The mean is given as 35.05 years. $\sigma_{x} = \frac{\sigma}{\sqrt{n}} = \frac{8.97}{\sqrt{50}} = 1.27\text{,}$ $z = \frac{\bar{x} - \mu_{x}}{\sigma_{x}} = \frac{37-35.05}{1.27} = 1.535\text{,}$ $P(Z>1.535)=0.062$ There is a 6.2% chance that the average age of 50 runners will be greater than 37.
###### Remember to divide by $\sqrt{{n}}$.

When finding the probability that an average or mean is greater or less than a particular value, remember to divide the standard deviation of the population by $\sqrt{n}$ to calculate the correct SD.

### Subsection4.2.5Section summary

• The symbol $\bar{x}$ denotes the sample average. $\bar{x}$ for any particular sample is a number. However, $\bar{x}$ can vary from sample to sample. The distribution of all possible values of $\bar{x}$ for repeated samples of a fixed size from a certain population is called the sampling distribution of $\bar{x}\text{.}$

• The standard deviation of $\bar{x}$ describes the typical error or distance of the sample mean from the population mean. It also tells us how much the sample mean is likely to vary from one random sample to another.

• The standard deviation of $\bar{x}$ will be smaller than the standard deviation of the population by a factor of $\sqrt{n}\text{.}$ The larger the sample, the better the estimate tends to be.

• Consider taking a simple random sample from a population with a fixed mean and standard deviation. The Central Limit Theorem ensures that regardless of the shape of the original population, as the sample size increases, the distribution of the sample average $\bar{x}$ becomes more normal.

• Three important facts about the sampling distribution of the sample average $\bar{x}\text{:}$

• The mean of a sample mean is denoted by $\mu_{\bar{x}}\text{,}$ and it is equal to $\mu\text{.}$ (center)

• The SD of a sample mean is denoted by $\sigma_{\bar{x}}\text{,}$ and it is equal to $\frac{\sigma}{\sqrt{n}}\text{.}$ (spread)

• When the population is normal or when $n\ge 30\text{,}$ the sample mean closely follows a normal distribution. (shape)

• These facts are used when solving the following two types of normal approximation problems involving a sample mean or a sample sum.

1. Find the probability that a sample average will be greater/less than a certain value.

1. Verify that the population is approximately normal or that $n \ge 30\text{.}$

2. Calculate the Z-score. Use $\mu_{\bar{x}}=\mu$ and $\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$ to standardize the sample average.

3. Find the appropriate area under the normal curve.

2. Find the probability that a sample sum/total will be greater/less than a certain value.

1. Convert the sample sum into a sample average, using $\bar{x} = \frac{sum}{n}\text{.}$

2. Do steps 1-3 from Part A above.

### Exercises4.2.6Exercises

###### 1.Ages of pennies, Part I.

The histogram below shows the distribution of ages of pennies at a bank.

1. Describe the distribution.

2. Sampling distributions for means from simple random samples of 5, 30, and 100 pennies is shown in the histograms below. Describe the shapes of these distributions and comment on whether they look like what you would expect to see based on the Central Limit Theorem.

Solution

(a) The distribution is unimodal and strongly right skewed with a median between 5 and 10 years old. Ages range from 0 to slightly over 50 years old, and the middle 50% of the distribution is roughly between 5 and 15 years old. There are potential outliers on the higher end.

(b) When the sample size is small, the sampling distribution is right skewed, just like the population distribution. As the sample size increases, the sampling distribution gets more unimodal, symmetric, and approaches normality. The variability also decreases. This is consistent with the Central Limit Theorem.

###### 2.Ages of pennies, Part II.

The mean age of the pennies from Exercise 4.2.6.1 is 10.44 years with a standard deviation of 9.2 years. Using the Central Limit Theorem, calculate the means and standard deviations of the distribution of the mean from random samples of size 5, 30, and 100. Comment on whether the sampling distributions shown in Exercise 4.2.6.1 agree with the values you compute.

###### 3.Housing prices, Part I.

A housing survey was conducted to determine the price of a typical home in Topanga, CA. The mean price of a house was roughly $1.3 million with a standard deviation of$300,000. There were no houses listed below $600,000 but a few houses above$3 million.

1. Is the distribution of housing prices in Topanga symmetric, right skewed, or left skewed? Hint: Sketch the distribution.

2. Would you expect most houses in Topanga to cost more or less than $1.3 million? 3. Can we estimate the probability that a randomly chosen house in Topanga costs more than$1.4 million using the normal distribution?