Subsection 4.5.1 The mean and standard deviation of \(\hat{p}\)
To answer these questions, we investigate the distribution of the sample proportion \(\hat{p}\text{.}\) In the last section we saw that the number of smokers in a sample of size 400 follows a binomial distribution with \(p=0.2\) and \(n=400\) that is centered on 80 and has standard deviation 8. What does the distribution of the proportion of smokers in a sample of size 400 look like? To convert from a count to a proportion, we divide the count (i.e. number of yeses) by the sample size, \(n = 400\text{.}\) For example, 60 becomes \(60/400 = 0.15\) as a proportion and 61 becomes \(61/400 = 0.1525\text{.}\)
We can find the general formula for the mean (expected value) and standard deviation of a sample proportion \(\hat{p}\) using our tools that we've learned so far. To get the sample mean for \(\hat{p}\text{,}\) we divide the binomial mean \(\mu_{binomial} = np\) by \(n\text{:}\)
\begin{gather*}
\mu_{\hat{p}} = \frac{\mu_{binomial}}{n} = \frac{np}{n} = p
\end{gather*}
As one might expect, the sample proportion \(\hat{p}\) is centered on the true proportion \(p\text{.}\) Likewise, the standard deviation of \(\hat{p}\) is equal to the standard deviation of the binomial distribution divided by \(n\text{:}\)
\begin{gather*}
\sigma_{\hat{p}}
= \frac{\sigma_{binomial}}{n}
= \frac{\sqrt{np(1-p)}}{n}
= \sqrt{\frac{p(1-p)}{n}}
\end{gather*}
Mean and standard deviation of a sample proportion
The mean and standard deviation of the sample proportion describe the center and spread of the distribution of all possible sample proportions \(\hat{p}\) from a random sample of size \(n\) with true population proportion \(p\text{.}\)
\begin{gather*}
\mu_{\hat{p}} = p \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}
\end{gather*}
In analyses, we think of the formula for the standard deviation of a sample proportion, \(\sigma_{\hat{p}}\text{,}\) as describing the uncertainty associated with the estimate \(\hat{p}\text{.}\) That is, \(\sigma_{\hat{p}}\) can be thought of as a way to quantify the typical error in our sample estimate \(\hat{p}\) of the true proportion \(p\text{.}\) Understanding the variability of statistics such as \(\hat{p}\) is a central component in the study of statistics.
Example 4.5.1
If the rate of smoking in the county is really 20%, find and interpret the mean and standard deviation of the sample proportion for a sample of size 400.
Solution
The mean of the sample proportion is the population proportion: 0.20. That is, if we took many, many samples and calculated \(\hat{p}\text{,}\) these values would average out to \(p = 0.20\text{.}\)
The standard deviation of \(\hat{p}\) is described by the standard deviation for the proportion:
\begin{gather*}
\sigma_{\hat{p}}
= \sqrt{\frac{p(1-p)}{n}}
= \sqrt{\frac{0.2 \times 0.8}{400}}
= .02
\end{gather*}
The sample proportion will typically be about 0.02 or 2% away from the true proportion of \(p = 0.20\text{.}\) We'll become more rigorous about quantifying how close \(\hat{p}\) will tend to be to \(p\) in Chapter 5.
Subsection 4.5.2 The Central Limit Theorem revisited
In Section 4.2, we saw the Central Limit Theorem, which states that for large enough \(n\text{,}\) the sample mean \(\bar{x}\) is normally distributed.
A natural question is, what does this have to do with sample proportions? In fact, a lot! A sample proportion can be written down as a sample mean. For example, suppose we have 3 successes in 10 trials. If we label each of the 3 success as a 1 and each of the 7 failures as a 0, then the sample proportion is the same as the sample mean:
\begin{gather*}
\hat{p}
= \frac{1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0}{10}
= \frac{3}{10}
= 0.3
\end{gather*}
That is, the distribution of the sample proportion is governed by the Central Limit Theorem, and the Central Limit Theorem is what ties much of the statistical theory we will see together.
Three important facts about the distribution of a sample proportion \(\hat{p}\)
Consider taking a simple random sample from a large population.
The mean of a sample proportion is \(p\text{.}\)
The SD of a sample proportion is \(\sqrt{\frac{p(1-p)}{n}}\text{.}\)
When \(np \geq 10\) and \(n(1-p) \geq 10\text{,}\) the sample proportion closely follows a normal distribution.
Using these facts, we can now answer the two questions posed at the beginning of this section.