## Section5.3Introducing hypothesis testing

In an experiment, one treatment reduces cholesterol by 10% while another treatment reduces it by 17%. Is this strong enough evidence that the second treatment is more effective? In this section, we will set up a framework for answering questions such as this and will look at the different types of decision errors that researcher can make when drawing conclusions based on data.

### Subsection5.3.1Learning objectives

1. Explain the logic of hypothesis testing, including setting up hypotheses and drawing a conclusion based on the set significance level and the calculated p-value.

2. Set up the null and alternative hypothesis in words and in terms of population parameters.

3. Interpret a p-value in context and recognize how the calculation of the p-value depends upon the direction of the alternative hypothesis.

4. Define and interpret the concept statistically significant.

5. Interpret Type I, Type II Error, and power in the context of hypothesis testing.

### Subsection5.3.2Case study: medical consultant

People providing an organ for donation sometimes seek the help of a special medical consultant. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant's clients.

One consultant tried to attract patients by noting the overall complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 9 complications in the 142 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

We will let $p$ represent the true complication rate for liver donors working with this consultant. Calculate the best estimate for $p$ using the data. Label the point estimate as $\hat{p}\text{.}$

Solution

The sample proportion for the complication rate is 9 complications divided by the 142 surgeries the consultant has worked on: $\hat{p} = 9 / 142 = 0.063\text{.}$

Is it possible to prove that the consultant's work reduces complications?

Solution

No. The claim implies that there is a causal connection, but the data are observational. For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.

While it is not possible to assess the causal claim, it is still possible to ask whether the low complication rate of $\hat{p} = 0.063$ provides evidence that the consultant's true complication rate is different than the US complication rate. Why might we be tempted to immediately conclude that the consultant's true complication rate is different than the US complication rate? Can we draw this conclusion?

Solution

Her sample complication rate is $\hat{p} = 0.063\text{,}$ which is 0.037 lower than the US complication rate of 10%. However, we cannot yet be sure if the observed difference represents a real difference or is just the result of random variation. We wouldn't expect the sample proportion to be exactly 0.10, even if the truth was that her real complication rate was 0.10.

### Subsection5.3.3Setting up the null and alternate hypothesis

We can set up two competing hypotheses about the consultant's true complication rate. The first is call the null hypothesis and represents either a skeptical perspective or a perspective of no difference. The second is called the alternative hypothesis (or alternate hypothesis) and represents a new perspective such as the possibility that there has been a change or that there is a treatment effect in an experiment.

###### Null and alternative hypotheses.

The null hypothesis is abbreviated $H_0\text{.}$ It represents a skeptical perspective and is often a claim of no change or no difference.

The alternative hypothesis is abbreviated $H_A\text{.}$ It is the claim researchers hope to prove or find evidence for, and it often asserts that there has been a change or an effect.

Our job as data scientists is to play the skeptic: before we buy into the alternative hypothesis, we need to see strong supporting evidence.

Identify the null and alternative claim regarding the consultant's complication rate.

Solution
• $H_{0}\text{:}$ The true complication rate for the consultant's clients is the same as the US complication rate of 10%.

• $H_{A}\text{:}$ The true complication rate for the consultant's clients is different than 10%.

Often it is convenient to write the null and alternative hypothesis in mathematical or numerical terms. To do so, we must first identify the quantity of interest. This quantity of interest is known as the parameter for a hypothesis test.

###### Parameters and point estimates.

A parameter for a hypothesis test is the “true” value of the population of interest. When the parameter is a proportion, we call it $p\text{.}$

A point estimate is calculated from a sample. When the point estimate is a proportion, we call it $\hat{p}\text{.}$

The observed or sample proportion of 0.063 is a point estimate for the true proportion. The parameter in this problem is the true proportion of complications for this consultant's clients. The parameter is unknown, but the null hypothesis is that it equals the overall proportion of complications: $p = 0.10\text{.}$ This hypothesized value is called the null value.

###### Null value of a hypothesis test.

The null value is the value hypothesized for the parameter in $H_0\text{,}$ and it is sometimes represented with a subscript 0, e.g. $p_0$ (just like $H_0$).

In the medical consultant case study, the parameter is $p$ and the null value is $p_0 = 0.10\text{.}$ We can write the null and alternative hypothesis as numerical statements as follows.

• $H_0\text{:}$ $p=0.10$ (The complication rate for the consultant's clients is equal to the US complication rate of 10%.)

• $H_A\text{:}$ $p \neq 0.10$ (The complication rate for the consultant's clients is not equal to the US complication rate of 10%.)

###### Hypothesis testing.

These hypotheses are part of what is called a hypothesis test. A hypothesis test is a statistical technique used to evaluate competing claims using data. Often times, the null hypothesis takes a stance of no difference or no effect. If the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.

Don't worry if you aren't a master of hypothesis testing at the end of this section. We'll discuss these ideas and details many times in this chapter and the two chapters that follow.

The null claim is always framed as an equality: it tells us what quantity we should use for the parameter when carrying out calculations for the hypothesis test. There are three choices for the alternative hypothesis, depending upon whether the researcher is trying to prove that the value of the parameter is greater than, less than, or not equal to the null value.

###### Always write the null hypothesis as an equality.

We will find it most useful if we always list the null hypothesis as an equality (e.g. $p = 7$) while the alternative always uses an inequality (e.g. $p \neq 0.7\text{,}$ $p>0.7\text{,}$ or $p\lt 0.7$).

According to the 2010 US Census, 7.6% of residents in the state of Alaska were under 5 years old. A researcher plans to take a random sample of residents from Alaska to test whether or not this is still the case. Write out the hypotheses that the researcher should test in both plain and statistical language.  1

$H_{0}: p=0.076\text{;}$ The proportion of residents under 5 years old in Alaska is unchanged from 2010. $H_{A}: p \ne 0.076\text{;}$ The proportion of residents under 5 years old in Alaska has changed from 2010. Note that it could have increased or decreased. https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml

When the alternative claim uses a $\neq\text{,}$ we call the test a two-sided test, because either extreme provides evidence against $H_0\text{.}$ When the alternative claim uses a $\lt$ or a $>\text{,}$ we call it a one-sided test.

###### One-sided and two-sided tests.

If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value — an increase or decrease — then the test should be two-sided.

For the example of the consultant's complication rate, we knew that her sample complication rate was 0.063, which was lower than the US complication rate of 0.10. Why did we conduct a two-sided hypothesis test for this setting?

Solution

The setting was framed in the context of the consultant being helpful, but what if the consultant actually performed worse than the US complication rate? Would we care? More than ever! Since we care about a finding in either direction, we should run a two-sided test.

###### One-sided hypotheses are allowed only before seeing data.

After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid this temptation. Hypotheses must be set up before observing the data. If they are not, the test must be two-sided.

### Subsection5.3.4Evaluating the hypotheses with a p-value

There were 142 patients in the consultant's sample. If the null claim is true, how many would we expect to have had a complication?

Solution

If the null claim is true, we would expect about 10% of the patients, or about 14.2 to have a complication.

The consultant's complication rate for her 142 clients was 0.063 ($0.063 \times 142 \approx 9$). What is the probability that a sample would produce a number of complications this far from the expected value of 14.2, if her true complication rate were 0.10, that is, if $H_0$ were true? The probability, which is estimated in Figure 5.3.8, is about 0.1754. We call this quantity the p-value. Figure 5.3.8. The shaded area represents the p-value. We observed $\hat{p} = 0.063\text{,}$ so any observations smaller than this are at least as extreme relative to the null value, $p_0 = 0.1\text{,}$ and so the lower tail is shaded. However, since this is a two-sided test, values above 0.137 are also at least as extreme as 0.063 (relative to 0.1), and so they also contribute to the p-value. The tail areas together total of about 0.1754 when calculated using a simulation technique in Subsection 5.3.5. Figure 5.3.9. When the alternative hypothesis takes the form $p \lt$ null value, the p-value is represented by the lower tail. When it takes the form $p >$ null value, the p-value is represented by the upper tail. When using $p \neq$ null value, then the p-value is represented by both tails.
###### Finding and interpreting the p-value.

We find and interpret the p-value according to the nature of the alternative hypothesis.

• $H_A\text{:}$ parameter $>$ null value. The p-value corresponds to the area in the upper tail and is probability of getting a test statistic larger than the observed test statistic if the null hypothesis is true and the probability model is accurate.

• $H_A\text{:}$ parameter $\lt$ null value. The p-value corresponds to the area in the lower tail and is the probability of observing a test statistic smaller than the observed test statistic if the null hypothesis is true and the probability model is accurate.

• $H_A\text{:}$ parameter $\ne$ null value. The p-value corresponds to the area in both tails and is the probability of observing a test statistic larger in magnitude than the observed test statistic if the null hypothesis is true and the probability model is accurate.

More generally, we can say that the p-value is the probability of getting a test statistic as extreme or more extreme than the observed test statistic in the direction of $H_A$ if the null hypothesis is true and the probability model is accurate.

When working with proportions, we can also say that the p-value is the probability of getting a sample proportion as far from or farther from the null proportion in the direction of $H_A$ if the null hypothesis is true and the normal model holds.

When the p-value is small, i.e. less than a previously set threshold, we say the results are statistically significant. This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis. The threshold is called the significance level and is represented by $\alpha$ (the Greek letter alpha ). The significance level is typically set to $\alpha = 0.05\text{,}$ but can vary depending on the field or the application.

###### Statistical significance.

If the p-value is less than the significance level $\alpha$ (usually 0.05), we say that the result is statistically significant. We reject $H_0\text{,}$ and we have strong evidence favoring $H_A\text{.}$

If the p-value is greater than the significance level $\alpha\text{,}$ we say that the result is not statistically significant. We do not reject $H_0\text{,}$ and we do not have strong evidence for $H_A\text{.}$

Recall that the null claim is the claim of no difference. If we reject $H_0\text{,}$ we are asserting that there is a real difference. If we do not reject $H_0\text{,}$ we are saying that the null claim is reasonable, but we are not saying that the null claim has been proven.

Because the p-value is 0.1754, which is larger than the significance level 0.05, we do not reject the null hypothesis. Explain what this means in the context of the problem using plain language.  2

The data do not provide evidence that the consultant's complication rate is significantly lower or higher than the US complication rate of 10%.

In the previous exercise, we did not reject $H_0\text{.}$ This means that we did not disprove the null claim. Is this equivalent to proving the null claim is true?

Solution

No. We did not prove that the consultant's complication rate is exactly equal to 10%. Recall that the test of hypothesis starts by assuming the null claim is true. That is, the test proceeds as an argument by contradiction. If the null claim is true, there is a 0.1754 chance of seeing sample data as divergent from 10% as we saw in our sample. Because 0.1754 is large, it is within the realm of chance error, and we cannot say the null hypothesis is unreasonable.  3

The p-value is a conditional probability. It is $P(\text{getting data at least as divergent from the null value as we observed } | H_{0} \text{ is true})\text{.}$ It is NOT $P(H_{0} \text{ is true } | \text { we got data this divergent from the null value})\text{.}$
###### Double negatives can sometimes be used in statistics.

In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying that we know it to be true.

Does the conclusion in Checkpoint 5.3.10 ensure that there is no real association between the surgical consultant's work and the risk of complications? Explain.

Solution

No. It is possible that the consultant's work is associated with a lower or higher risk of complications. If this was the case, the sample may have been too small to reliable detect this effect.

An experiment was conducted where study participants were randomly divided into two groups. Both were given the opportunity to purchase a DVD, but one half was reminded that the money, if not spent on the DVD, could be used for other purchases in the future, while the other half was not. The half that was reminded that the money could be used on other purchases was 20% less likely to continue with a DVD purchase. We determined that such a large difference would only occur about 1-in-150 times if the reminder actually had no influence on student decision-making. What is the p-value in this study? Was the result statistically significant?

Solution

The p-value was 0.006 (about 1/150). Since the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.

###### What's so special about 0.05?

We often use a threshold of 0.05 to determine whether a result is statistically significant. But why 0.05? Maybe we should use a bigger number, or maybe a smaller number. If you're a little puzzled, that probably means you're reading with a critical eye — good job! We've made a video to help clarify why 0.05:

www.openintro.org/why05

Sometimes it's a good idea to deviate from the standard. We'll discuss when to choose a threshold different than 0.05 in Subsection 5.3.8.

Statistical inference is the practice of making decisions and conclusions from data in the context of uncertainty. Just as a confidence interval may occasionally fail to capture the true value of the parameter, a test of hypothesis may occasionally lead us to an incorrect conclusion. While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur.

### Subsection5.3.5Calculating the p-value by simulation (special topic)

When conditions for the applying a normal model are met, we use a normal model to find the p-value of a test of hypothesis. In the complication rate example, the distribution is not normal. It is, however, binomial, because we are interested in how many out of 142 patients will have complications.

We could calculate the p-value of this test using binomial probabilities. A more general approach, though, for calculating p-values when a normal model does not apply is to use what is known as simulation. While performing this procedure is outside of the scope of the course, we provide an example here in order to better understand the concept of a p-value.

We simulate 142 new patients to see what result might happen if the complication rate really is 0.10. To do this, we could use a deck of cards. Take one red card, nine black cards, and mix them up. If the cards are well-shuffled, drawing the top card is one way of simulating the chance a patient has a complication if the true rate is 0.10: if the card is red, we say the patient had a complication, and if it is black then we say they did not have a complication. If we repeat this process 142 times and compute the proportion of simulated patients with complications, $\hat{p}_{sim}\text{,}$ then this simulated proportion is exactly a draw from the null distribution.

There were 12 simulated cases with a complication and 130 simulated cases without a complication: $\hat{p}_{sim} = 12 / 142 = 0.085\text{.}$

One simulation isn't enough to get a sense of the null distribution, so we repeated the simulation 10,000 times using a computer. Figure 5.3.14 shows the null distribution from these 10,000 simulations. The simulated proportions that are less than or equal to $\hat{p}=0.063$ are shaded. There were 0.0877 simulated sample proportions with $\hat{p}_{sim} \leq 0.063\text{,}$ which represents a fraction 0.0877 of our simulations:

\begin{gather*} \text{ left tail } = \frac{\text{ Number of observed simulations with } \hat{p}_{sim}\leq\text{ 0.063 } }{10000} = \frac{877}{10000} = 0.0877 \end{gather*}

However, this is not our p-value! Remember that we are conducting a two-sided test, so we should double the one-tail area to get the p-value: 4

\begin{gather*} \text{ p-value } = 2 \times \text{ left tail } = 2 \times 0.0877 = 0.1754 \end{gather*}
This doubling approach is preferred even when the distribution isn't symmetric, as in this case. Figure 5.3.14. The null distribution for $\hat{p}\text{,}$ created from 10,000 simulated studies. The left tail contains 8.77% of the simulations. For a two-sided test, we double the tail area to get the p-value. This doubling accounts for the observations we might have observed in the upper tail, which are also at least as extreme (relative to 0.10) as what we observed, $\hat{p} = 0.063\text{.}$

### Subsection5.3.6Hypothesis testing: a five step process

Use a hypothesis test to test $H_0$ versus $H_A$ at a particular signficance level, $\alpha\text{.}$

###### (AP exam tip) When carrying out a hypothesis test procedure, follow these five steps:.
• Identify: Identify the hypotheses and the significance level.

• Choose: Choose the appropriate test procedure and identify it by name.

• Check: Check that the conditions for the test procedure are met.

• Calculate: Calculate the test statistic and the p-value.

\begin{gather*} \text{ test statistic } = \frac{\text{ point estimate } - \text{ null value } }{SE\ \text{ of estimate } } \end{gather*}
• Conclude: Compare the p-value to the significance level to determine whether to reject $H_0$ or not reject $H_0\text{.}$ Draw a conclusion in the context of $H_A\text{.}$

### Subsection5.3.7Decision errors

The hypothesis testing framework is a very general tool, and we often use it without a second thought. If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is sufficient evidence that supports the claim, we set aside our skepticism. The hallmarks of hypothesis testing are also found in the US court system.

A US court considers two possible claims about a defendant: she is either innocent or guilty. If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?

Solution

The jury considers whether the evidence is so convincing (strong) that there is evidence beyond a reasonable doubt of the person's guilt. That is, the starting assumption (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis). In statistics, our evidence comes in the form of data, and we use the significance level to decide what is beyond a reasonable doubt.

Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Notice that a jury finds a defendant either guilty or not guilty. They either reject the null claim or they do not reject the null claim. They never prove the null claim, that is, they never find the defendant innocent. If a jury finds a defendant not guilty, this does not necessarily mean the jury is confident in the person's innocence. They are simply not convinced of the alternative that the person is guilty.

This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as truth. Failing to find strong evidence for the alternative hypothesis is not equivalent to providing evidence that the null hypothesis is true.

Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong conclusion. However, what distinguishes statistical hypothesis tests from a court system is that our framework allows us to quantify and control how often the data lead us to the incorrect conclusion.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 5.3.16.

###### Type I and Type II Errors.

A Type I Error is rejecting $H_0$ when $H_0$ is actually true. When we reject the null hypothesis, it is possible that we make a Type I Error.

A Type II Error is failing to reject $H_0$ when $H_A$ is actually true. When we did not reject the null hypothesis, it is possible that we make a Type II Error.

In a US court, the defendant is either innocent ($H_0$) or guilty ($H_A$). What does a Type I Error represent in this context? What does a Type II Error represent? Table 5.3.16 may be useful.

Solution

If the court makes a Type I Error, this means the defendant is innocent ($H_0$ true) but wrongly convicted. A Type II Error means the court failed to reject $H_0$ (i.e. failed to convict the person) when they were in fact guilty ($H_A$ true).

How could we reduce the Type I Error rate in US courts? What influence would this have on the Type II Error rate?

Solution

To lower the Type I Error rate, we might raise our standard for conviction from “beyond a reasonable doubt” to “beyond a conceivable doubt” so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type II Errors.

How could we reduce the Type II Error rate in US courts? What influence would this have on the Type I Error rate?  5

To lower the Type II Error rate, we want to convict more guilty people. We could lower the standards for conviction from “beyond a reasonable doubt” to “beyond a little doubt”. Lowering the bar for guilt will also result in more wrongful convictions, raising the Type I Error rate.

A group of women bring a class action lawsuit that claims discrimination in promotion rates. What would a Type I Error represent in this context?  6

We must first identify which is the null hypothesis and which is the alternative. The alternative hypothesis is the one that bears the burden of proof, so the null hypothesis is that there was no discrimination and the alternative hypothesis is that there was discrimination. Making a Type I Error in this context would mean that in fact there was no discrimination, even though we concluded that women were discriminated against. Notice that this does not necessarily mean something was wrong with the data or that we made a computational mistake. Sometimes data simply point us to the wrong conclusion, which is why scientific studies are often repeated to check initial findings.

These examples provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.

### Subsection5.3.8Choosing a significance level

If $H_0$ is true, what is the probability that we will incorrectly reject it? In hypothesis testing, we perform calculations under the premise that $H_0$ is true, and we reject $H_0$ if the p-value is smaller than the significance level $\alpha\text{.}$ That is, $\alpha$ is the probability of making a Type I Error. The choice of what to make $\alpha$ is not arbitrary. It depends on the gravity of the consequences of a Type I Error.

###### Relationship between Type I and Type II Errors.

The probability of a Type I Error is called $\alpha$ and corresponds to the significance level of a test. The probability of a Type II Error is called $\beta\text{.}$ As we make $\alpha$ smaller, $\beta$ typically gets larger, and vice versa.

If making a Type I Error is especially dangerous or especially costly, should we choose a smaller significance level or a higher significance level?

Solution

Under this scenario, we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence before we are willing to reject the null hypothesis. Therefore, we want a smaller significance level, maybe $\alpha = 0.01\text{.}$

If making a Type II Error is especially dangerous or especially costly, should we choose a smaller significance level or a higher significance level?

Solution

We should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject $H_0$ when the null is actually false.

###### Significance levels should reflect consequences of errors.

The significance level selected for a test should reflect the real-world consequences associated with making a Type I or Type II Error. If a Type I Error is very dangerous, make $\alpha$ smaller.

### Subsection5.3.9Statistical power of a hypothesis test

When the alternative hypothesis is true, the probability of not making a Type II Error is called power. It is common for researchers to perform a power analysis to ensure their study collects enough data to detect the effects they anticipate finding. As you might imagine, if the effect they care about is small or subtle, then if the effect is real, the researchers will need to collect a large sample size in order to have a good chance of detecting the effect. However, if they are interested in large effect, they need not collect as much data.

The Type II Error rate $\beta$ and the magnitude of the error for a point estimate are controlled by the sample size. As the sample size $n$ goes up, the Type II Error rate goes down, and power goes up. Real differences from the null value, even large ones, may be difficult to detect with small samples. However, if we take a very large sample, we might find a statistically significant difference but the size of the difference might be so small that it is of no practical value.

### Subsection5.3.10Section summary

• A hypothesis test is a statistical technique used to evaluate competing claims based on data.

• The competing claims are called hypotheses and are often about population parameters (e.g. $\mu$ and $p$); they are never about sample statistics.

• The null hypothesis is abbreviated $H_0\text{.}$ It represents a skeptical perspective or a perspective of no difference or no change.

• The alternative hypothesis is abbreviated $H_A\text{.}$ It represents a new perspective or a perspective of a real difference or change. Because the alternative hypothesis is the stronger claim, it bears the burden of proof.

• The logic of a hypothesis test: In a hypothesis test, we begin by assuming that the null hypothesis is true. Then, we calculate how unlikely it would be to get a sample value as extreme as we actually got in our sample, assuming that the null value is correct. If this likelihood is too small, it casts doubt on the null hypothesis and provides evidence for the alternative hypothesis.

• We set a significance level, denoted $\alpha\text{,}$ which represents the threshold below which we will reject the null hypothesis. The most common significance level is $\alpha = 0.05\text{.}$ If we require more evidence to reject the null hypothesis, we use a smaller $\alpha\text{.}$

• After verifying that the relevant conditions are met, we can calculate the test statistic. The test statistic tells us how many standard errors the point estimate (sample value) is from the null value (i.e. the value hypothesized for the parameter in the null hypothesis). When investigating a single mean or proportion or a difference of means or proportions, the test statistic is calculated as: $\frac{\text{ point estimate } -\text{ null value } }{SE \text{ of estimate } }\text{.}$

• After the test statistic, we calculate the p-value. We find and interpret the p-value according to the nature of the alternative hypothesis. The three possibilities are:

• $H_A\text{:}$ parameter $>$ null value. The p-value corresponds to the area in the upper tail.

• $H_A\text{:}$ parameter $\lt$ null value. The p-value corresponds to the area in the lower tail.

• $H_A\text{:}$ parameter $\ne$ null value. The p-value corresponds to the area in both tails.

The p-value is the probability of getting a test statistic as extreme or more extreme than the observed test statistic in the direction of $H_A$ if the null hypothesis is true and the probability model is accurate.

• The conclusion or decision of a hypothesis test is based on whether the p-value is smaller or larger than the preset significance level $\alpha\text{.}$

• When the p-value $\lt \alpha\text{,}$ we say the results are statistically significant at the $\alpha$ level and we have evidence of a real difference or change. The observed difference is beyond what would have been expected from chance variation alone. This leads us to reject $H_0$ and gives us evidence for $H_A\text{.}$

• When the p-value $> \alpha\text{,}$ we say the results are not statistically significant at the $\alpha$ level and we do not have evidence of a real difference or change. The observed difference was within the realm of expected chance variation. This leads us to not reject $H_0$ and does not give us evidence for $H_A\text{.}$

• AP exam tip: A full hypothesis test includes the following steps.

1. Identify: Identify the hypotheses and the significance level.

2. Choose: Choose the appropriate test procedure and identify it by name.

3. Check: Check that the conditions for the test procedure are met.

4. Calculate: Calculate the test statistic and the p-value.

\begin{gather*} \text{ test statistic } = \frac{\text{ point estimate } - \text{ null value } }{SE\ \text{ of estimate } } \end{gather*}
5. Conclude: Compare the p-value to the significance level to determine whether to reject $H_0$ or not reject $H_0\text{.}$ Draw a conclusion in the context of $H_A\text{.}$

• Decision errors. In a hypothesis test, there are two types of decision errors that could be made. These are called Type I and Type II Errors.

• A Type I Error is rejecting $H_0\text{,}$ when $H_0$ is actually true. We commit a Type I Error if we call a result significant when there is no real difference or effect. P(Type I error) = $\alpha\text{.}$

• A Type II Error is not rejecting $H_0\text{,}$ when $H_A$ is actually true. We commit a Type II Error if we call a result not significant when there is a real difference or effect. P(Type II error) = $\beta\text{.}$

• The probability of a Type I Error ($\alpha$) and a Type II Error ($\beta$) are inversely related. Decreasing $\alpha$ makes $\beta$ larger; increasing $\alpha$ makes $\beta$ smaller.

• Once a decision is made, only one of the two types of errors is possible. If the test rejects $H_0\text{,}$ for example, only a Type I Error is possible.

• The power of a test.

• When a particular $H_A$ is true, the probability of not making a Type II Error is called power. Power $= 1 - \beta\text{.}$

• The power of a test is the probability of detecting an effect of a particular size when it is present.

• Increasing the significance level decreases the probability of a Type II Error and increases power. $\alpha \uparrow, \beta \downarrow, \text{ power } \uparrow\text{.}$

• For a fixed $\alpha\text{,}$ increasing the sample size $n$ makes it easier to detect an effect and therefore decreases the probability of a Type II Error and increases power. $n \uparrow, \beta \downarrow, \text{ power } \uparrow\text{.}$

### Exercises5.3.11Exercises

###### 1.Identify hypotheses, Part I.

Write the null and alternative hypotheses in words and then symbols for each of the following situations.

1. A tutoring company would like to understand if most students tend to improve their grades (or not) after they use their services. They sample 200 of the students who used their service in the past year and ask them if their grades have improved or declined from the previous year.

2. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non-business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity changed during March Madness.

Solution

(a) $H_{0} : p = 0.5$ (Neither a majority nor minority of students' grades improved) $H_{A} : p \ne 0.5$ (Either a majority or a minority of students' grades improved)

(b) $H_{0} : \mu = 15$ (The average amount of company time each employee spends not working is 15 minutes for March Madness.) $H_{A} : p \ne = 15$ (The average amount of company time each employee spends not working is different than 15 minutes for March Madness.)

###### 2.Identify hypotheses, Part II.

Write the null and alternative hypotheses in words and using symbols for each of the following situations.

1. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?

2. The state of Wisconsin would like to understand the fraction of its adult residents that consumed alcohol in the last year, specifically if the rate is different from the national rate of 70%. To help them answer this question, they conduct a random sample of 852 residents and ask them about their alcohol consumption.

###### 3.Online communication.

A study suggests that 60% of college student spend 10 or more hours per week communicating with others online. You believe that this is incorrect and decide to collect your own sample for a hypothesis test. You randomly sample 160 students from your dorm and find that 70% spent 10 or more hours a week communicating with others online. A friend of yours, who offers to help you with the hypothesis test, comes up with the following set of hypotheses. Indicate any errors you see.

\begin{align*} H_0\amp : \hat{p} \lt 0.6\\ H_A\amp : \hat{p} > 0.7 \end{align*}
Solution

(1) The hypotheses should be about the population proportion (p), not the sample proportion. (2) The null hypothesis should have an equal sign. (3) The alternative hypothesis should have a not equals sign, and (4) it should reference the null value, $p_{0} = 0.6\text{,}$ not the observed sample proportion. The correct way to set up these hypotheses is: $H_{0} : p = 0.6$ and $H_{A} : p \ne 0.6\text{.}$

###### 4.Married at 25.

A study suggests that the 25% of 25 year olds have gotten married. You believe that this is incorrect and decide to collect your own sample for a hypothesis test. From a random sample of 25 year olds in census data with size 776, you find that 24% of them are married. A friend of yours offers to help you with setting up the hypothesis test and comes up with the following hypotheses. Indicate any errors you see.

\begin{align*} H_0\amp : \hat{p} = 0.24\\ H_A\amp : \hat{p} \neq 0.24 \end{align*}
###### 5.Cyberbullying rates.

Teens were surveyed about cyberbullying, and 54% to 64% reported experiencing cyberbullying (95% confidence interval). 7  Answer the following questions based on this interval.

1. A newspaper claims that a majority of teens have experienced cyberbullying. Is this claim supported by the confidence interval? Explain your reasoning.

2. A researcher conjectured that 70% of teens have experienced cyberbullying. Is this claim supported by the confidence interval? Explain your reasoning.

3. Without actually calculating the interval, determine if the claim of the researcher from part Item 5.3.11.5.b would be supported based on a 90% confidence interval?

Pew Research Center, A Majority of Teens Have Experienced Some Form of Cyberbullying. September 27, 2018.
Solution

(a) This claim is reasonable, since the entire interval lies above 50%.

(b) The value of 70% lies outside of the interval, so we have convincing evidence that the researcher's conjecture is wrong.

(c) A 90% confidence interval will be narrower than a 95% confidence interval. Even without calculating the interval, we can tell that 70% would not fall in the interval, and we would reject the researcher's conjecture based on a 90% confidence level as well.

###### 6.Waiting at an ER, Part II.

Exercise 5.2.9.5 provides a 95% confidence interval for the mean waiting time at an emergency room (ER) of (128 minutes, 147 minutes). Answer the following questions based on this interval.

1. A local newspaper claims that the average waiting time at this ER exceeds 3 hours. Is this claim supported by the confidence interval? Explain your reasoning.

2. The Dean of Medicine at this hospital claims the average wait time is 2.2 hours. Is this claim supported by the confidence interval? Explain your reasoning.

3. Without actually calculating the interval, determine if the claim of the Dean from part Item 5.3.11.6.b would be supported based on a 99% confidence interval?

###### 7.Minimum wage, Part 1.

Do a majority of US adults believe raising the minimum wage will help the economy, or is there a majority who do not believe this? A Rasmussen Reports survey of 1,000 US adults found that 42% believe it will help the economy. 8  Conduct an appropriate hypothesis test to help answer the research question.

Rasmussen Reports survey, Most Favor Minimum Wage of \$10.50 Or Higher, April 16, 2019
Solution

(i) Set up hypotheses. $H_{0}: p = 0.5\text{,}$ $H_{A}: p \ne 0.5\text{.}$ We will use a significance level of $\alpha = 0.05\text{.}$

(ii) Check conditions: simple random sample gets us independence, and the success-failure conditions is satisfied since $0.42 \times 1000 = 420$ and $(1-0.42) \times 1000 = 580$ are both at least 10.

(iii) Next, we calculate: $SE = \sqrt{0.5(1 - 0.5)/1000} = 0.016\text{.}$ $Z = \frac{0.42 - 0.5}{0.016} = -5\text{,}$ which has a one-tail area of about 0.0000003, so the p-value is twice this one-tail area at 0.0000006.

(iv) Make a conclusion: Because the p-value is less than $\alpha = 0:05\text{,}$ we reject the null hypothesis and conclude that the fraction of US adults who believe raising the minimum wage will help the economy is not 50%. Because the observed value is less than 50% and we have rejected the null hypothesis, we can conclude that this belief is held by fewer than 50% of US adults. (For reference, the survey also explores support for changing the minimum wage, which is a different question than if it will help the economy.)

###### 8.Getting enough sleep.

400 students were randomly sampled from a large university, and 289 said they did not get enough sleep. Conduct a hypothesis test to check whether this represents a statistically significant difference from 50%, and use a significance level of 0.01.

###### 9.Working backwards, Part I.

You are given the following hypotheses:

\begin{align*} H_0\amp : p = 0.3\\ H_A\amp : p \ne 0.3 \end{align*}

We know the sample size is 90. For what sample proportion would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.

Solution

If the p-value is 0.05, this means the test statistic would be either $Z = -1.96$ or $Z = 1.96\text{.}$ We'll show the calculations for $Z = 1.96\text{.}$ Standard error: $SE = \sqrt{0.3(1-0.3)/90} = 0.048\text{.}$ Finally, set up the test statistic formula and solve for $\hat{p}: 1.96 = \frac{\hat{p} - 0.3}{0.048} \rightarrow \hat{p} = 0.394$ Alternatively, if $Z = -1.96$ was used: $\hat{p} = 0.206\text{.}$

###### 10.Working backwards, Part II.

You are given the following hypotheses:

\begin{align*} H_0\amp : p = 0.9\\ H_A\amp : p \ne 0.9 \end{align*}

We know that the sample size is 1,429. For what sample proportion would the p-value be equal to 0.01? Assume that all conditions necessary for inference are satisfied.

###### 11.Testing for Fibromyalgia.

A patient named Diana was diagnosed with Fibromyalgia, a long-term syndrome of body pain, and was prescribed anti-depressants. Being the skeptic that she is, Diana didn't initially believe that anti-depressants would help her symptoms. However after a couple months of being on the medication she decides that the anti-depressants are working, because she feels like her symptoms are in fact getting better.

1. Write the hypotheses in words for Diana's skeptical position when she started taking the anti-depressants.

2. What is a Type 1 Error in this context?

3. What is a Type 2 Error in this context?

Solution

(a) $H_{0}\text{:}$ Anti-depressants do not affect the symptoms of Fibromyalgia. $HA\text{:}$ Anti-depressants do affect the symptoms of Fibromyalgia (either helping or harming).

(b) Concluding that anti-depressants either help or worsen Fibromyalgia symptoms when they actually do neither.

(c) Concluding that anti-depressants do not affect Fibromyalgia symptoms when they actually do.

###### 12.Which is higher?

In each part below, there is a value of interest and two scenarios (I and II). For each part, report if the value of interest is larger under scenario I, scenario II, or whether the value is equal under the scenarios.

1. The standard error of $\hat{p}$ when (I) $n = 125$ or (II) $n = 500\text{.}$

2. The margin of error of a confidence interval when the confidence level is (I) 90% or (II) 80%.

3. The p-value for a Z-statistic of 2.5 calculated based on a (I) sample with $n = 500$ or based on a (II) sample with $n = 1000\text{.}$

4. The probability of making a Type 2 Error when the alternative hypothesis is true and the significance level is (I) 0.05 or (II) 0.10.