Sampling distribution of a sample proportion

Section 4.5 Sampling distribution of a sample proportion

The binomial distribution shows us the distribution of number of successes in \(n\) trials. Often, we are interested in the proportion of successes rather than the number of successes. We would like to answer questions such as the following:

Approximately 20% of the US population smokes cigarettes. A random sample of size 400 from a particular county found that 15% of the sample smoked. If the smoking rate in this county really is 20%, what is the probability that the sample would contain 15% or fewer smokers?
Given a population that is 50% male, what is the probability that a sample of size of 200 people would consist of more than 55% males?

Subsection 4.5.1 The mean and standard deviation of \(\hat{p}\)

To answer these questions, we investigate the distribution of the sample proportion \(\hat{p}\text{.}\) In the last section we saw that the number of smokers in a sample of size 400 follows a binomial distribution with \(p=0.2\) and \(n=400\) that is centered on 80 and has standard deviation 8. What does the distribution of the proportion of smokers in a sample of size 400 look like? To convert from a count to a proportion, we divide the count (i.e. number of yeses) by the sample size, \(n = 400\text{.}\) For example, 60 becomes \(60/400 = 0.15\) as a proportion and 61 becomes \(61/400 = 0.1525\text{.}\)

We can find the general formula for the mean (expected value) and standard deviation of a sample proportion \(\hat{p}\) using our tools that we've learned so far. To get the sample mean for \(\hat{p}\text{,}\) we divide the binomial mean \(\mu_{binomial} = np\) by \(n\text{:}\)

\begin{gather*} \mu_{\hat{p}} = \frac{\mu_{binomial}}{n} = \frac{np}{n} = p \end{gather*}

As one might expect, the sample proportion \(\hat{p}\) is centered on the true proportion \(p\text{.}\) Likewise, the standard deviation of \(\hat{p}\) is equal to the standard deviation of the binomial distribution divided by \(n\text{:}\)

\begin{gather*} \sigma_{\hat{p}} = \frac{\sigma_{binomial}}{n} = \frac{\sqrt{np(1-p)}}{n} = \sqrt{\frac{p(1-p)}{n}} \end{gather*}

Mean and standard deviation of a sample proportion

The mean and standard deviation of the sample proportion describe the center and spread of the distribution of all possible sample proportions \(\hat{p}\) from a random sample of size \(n\) with true population proportion \(p\text{.}\)

\begin{gather*} \mu_{\hat{p}} = p \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \end{gather*}

In analyses, we think of the formula for the standard deviation of a sample proportion, \(\sigma_{\hat{p}}\text{,}\) as describing the uncertainty associated with the estimate \(\hat{p}\text{.}\) That is, \(\sigma_{\hat{p}}\) can be thought of as a way to quantify the typical error in our sample estimate \(\hat{p}\) of the true proportion \(p\text{.}\) Understanding the variability of statistics such as \(\hat{p}\) is a central component in the study of statistics.

Example 4.5.1

If the rate of smoking in the county is really 20%, find and interpret the mean and standard deviation of the sample proportion for a sample of size 400.

Solution

The mean of the sample proportion is the population proportion: 0.20. That is, if we took many, many samples and calculated \(\hat{p}\text{,}\) these values would average out to \(p = 0.20\text{.}\)

The standard deviation of \(\hat{p}\) is described by the standard deviation for the proportion:

\begin{gather*} \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.2 \times 0.8}{400}} = .02 \end{gather*}

The sample proportion will typically be about 0.02 or 2% away from the true proportion of \(p = 0.20\text{.}\) We'll become more rigorous about quantifying how close \(\hat{p}\) will tend to be to \(p\) in Chapter 5.

Subsection 4.5.2 The Central Limit Theorem revisited

In Section 4.2, we saw the Central Limit Theorem, which states that for large enough \(n\text{,}\) the sample mean \(\bar{x}\) is normally distributed.

A natural question is, what does this have to do with sample proportions? In fact, a lot! A sample proportion can be written down as a sample mean. For example, suppose we have 3 successes in 10 trials. If we label each of the 3 success as a 1 and each of the 7 failures as a 0, then the sample proportion is the same as the sample mean:

\begin{gather*} \hat{p} = \frac{1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0}{10} = \frac{3}{10} = 0.3 \end{gather*}

That is, the distribution of the sample proportion is governed by the Central Limit Theorem, and the Central Limit Theorem is what ties much of the statistical theory we will see together.

Three important facts about the distribution of a sample proportion \(\hat{p}\)

Consider taking a simple random sample from a large population.

The mean of a sample proportion is \(p\text{.}\)
The SD of a sample proportion is \(\sqrt{\frac{p(1-p)}{n}}\text{.}\)
When \(np \geq 10\) and \(n(1-p) \geq 10\text{,}\) the sample proportion closely follows a normal distribution.

Using these facts, we can now answer the two questions posed at the beginning of this section.

Subsection 4.5.3 Normal approximation for the distribution of \(\hat{p}\)

Example 4.5.2

Find the probability that less than 15% of the sample of 400 people will be smokers if the true proportion is 20%.

Solution

In the previous section we verified that \(np\) and \(n(1-p)\) are at least 10. The mean of the sample proportion is 0.20 and the standard deviation for the sample proportion is given by \(\sqrt{\frac{0.2(1-0.2)}{400}}=0.02\text{.}\) We can find a Z-score and use our calculator to find the probability:

\begin{gather*} Z = \frac{\hat{p} - \mu_{\hat{p}}}{\sigma_{\hat{p}}} = \frac{0.15 - 0.20}{0.02} = -2.5\\ P( Z \lt 2.5) = 0.0062 \end{gather*}

We leave it to the reader to construct a figure for this example.

Example 4.5.3

The probability 0.0062 is the same probability we calculated when we found the probability of getting 60 or fewer smokers out of 400! Why is this?

Solution

Notice that \(60/400=0.15\text{.}\) Using the binomial distribution to find the probability of 60 or fewer smokers in the sample is equivalent to using the probability that \(\hat{p}\) will be less than or equal to 15%.

Guided Practice 4.5.4

Given a population that is 50% male, what is the probability that a sample of size 200 would have greater than 55% males? Remember to verify that conditions for normal approximation are met.¹First, verify the conditions: \(np = 200 \times 0.5 = 100 \ge 10\) and \(n(1-p) = 200 \times 0.5 = 100 \ge 10\text{,}\) so the normal approximation is reasonable. Next we find the mean and standard deviation of \(\hat{p}\text{:}\)

\begin{gather*} \mu_{\hat{p}} = p = 0.50\\ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.5 \times 0.5}{200}} = 0.0354 \end{gather*}

Then we find a Z-score and find the upper tail of the normal distribution:

\begin{gather*} Z = \frac{\hat{p} - \mu_{\hat{p}}}{\sigma_{\hat{p}}} = \frac{0.55 - 0.5}{0.0354} = 1.412 \rightarrow P(Z \gt 1.412) = 0.07 \end{gather*}

The probability of getting a sample proportion of 55% or greater is about 0.07.