Section 5.4 Does it make sense?
¶It is the responsibility of the data scientist to know when the use of these inference procedures is appropriate and to correctly interpret the results. In this section, we look at considerations around the misuse or misinterpretation of these procedures.
Subsection 5.4.1 Learning objectives
Understand the two general conditions for when the confidence interval and hypothesis testing procedures apply. Explain why these conditions are necessary.
Distinguish between statistically significant and practically significant. What role does sample size play here?
Recognize that not all statistically significant results correspond to real differences, due to Type I Errors. What role does the significance level \(\alpha\) play here?
Subsection 5.4.2 When to retreat
¶Statistical tools rely on conditions. When the conditions are not met, these tools are unreliable and drawing conclusions from them is treacherous. The conditions for these tools typically come in two forms.
The individual observations must be independent. A random sample from less than 10% of the population ensures the observations are independent. In experiments, we generally require that subjects are randomized into groups. If independence fails, then advanced techniques must be used, and in some such cases, inference may not be possible.
Other conditions focus on sample size and skew. For example, if the sample size is too small, the skew too strong, or extreme outliers are present, then a normal model for the sample mean will fail.
Verification of conditions for statistical tools is always necessary. Whenever conditions are not satisfied for a statistical technique, there are three options. The first is to learn new methods that are appropriate for the data. The second route is to consult a data scientist. 1 The third route is to ignore the failure of conditions. This last option effectively invalidates any analysis and may discredit novel and interesting findings.
Finally, we caution that there may be no inference tools helpful when considering data that include unknown biases, such as convenience samples. For this reason, there are books, courses, and researchers devoted to the techniques of sampling and experimental design. See Section 1.3, Section 1.4 and Section 1.5 for basic principles of data collection.
Subsection 5.4.3 Statistical significance versus practical significance
When the sample size becomes larger, point estimates become more precise and any real differences in the mean and null value become easier to detect and recognize. Even a very small difference would likely be detected if we took a large enough sample. Sometimes researchers will take such large samples that even the slightest difference is detected. While we still say that difference is statistically significant, it might not be practically significant.
Statistically significant differences are sometimes so minor that they are not practically relevant. This is especially important to research: if we conduct a study, we want to focus on finding a meaningful result. We don't want to spend lots of money finding results that hold no practical value.
The role of a data scientist in conducting a study often includes planning the size of the study. The data scientist might first consult experts or scientific literature to learn what would be the smallest meaningful difference from the null value. She also would obtain some reasonable estimate for the standard deviation. With these important pieces of information, she would choose a sufficiently large sample size so that the power for the meaningful difference is perhaps 80% or 90%. While larger sample sizes may still be used, she might advise against using them in some cases, especially in sensitive areas of research.
Subsection 5.4.4 Statistical significance versus a real difference
When a result is statistically significant at the \(\alpha=0.05\) level, we have evidence that the result is real. However, when there is no difference or effect, we can expect that 5% of the time the test conclusion will lead to a Type I Error and incorrectly reject the null hypothesis. Therefore we must beware of what is called p-hacking, in which researchers may test many, many hypotheses and then publish the ones that come out statistically significant. As we noted, we can expect 5% of the results to be significant when the null hypothesis is true and there really is no difference or effect. 2
Subsection 5.4.5 Section summary
The inference procedures in this book require two conditions to be met.
The first is that some type of random sampling or random assignment must be involved. If this is not the case, the point statistic may be biased and may not follow the intended distribution. Moreover, without a random sample or random assignment, there is no way to accurately measure the standard error. (When sampling without replacement, the sample size should be less than 10% of the population size in order for the standard error formula to apply. In sample surveys, this condition is generally met.)
The second condition focuses on sample size and skew to determine whether the point estimate follows the intended distribution.
Understanding what the term statistically significant does and does not mean.
A small percent of the time (\(\alpha\)), a significant result will not be a real result. If many tests are run, a small percent of them will produce significant results due to chance alone. 3
With a very large sample, a significant result may point to a result that is real but unimportant. With a larger sample, the power of a test increases and it becomes easier to detect a small difference. If an extremely large sample is used, the result may be statistically significant, but not be practically significant. That is, the difference detected may be so small as to be unimportant or meaningless.
Subsection 5.4.6 Chapter Highlights
Statistical inference is the practice of making decisions from data in the context of uncertainty. In this chapter, we introduced two frameworks for inference: confidence intervals and hypothesis tests.
Confidence intervals are used for estimating unknown population parameters by providing an interval of reasonable values for the unknown parameter with a certain level of confidence.
Hypothesis tests are used to assess how reasonable a particular value is for an unknown population parameter by providing degrees of evidence against that value.
-
The results of confidence intervals and hypothesis tests are, generally speaking, consistent. 4 That is:
Values that fall inside a 95% confidence interval (implying they are reasonable) will not be rejected by a test at the 5% significance level (implying they are reasonable), and vice-versa.
Values that fall outside a 95% confidence interval (implying they are not reasonable) will be rejected by a test at the 5% significance level (implying they are not reasonable), and vice-versa.
When the confidence level and the significance level add up to 100%, the conclusions of the two procedures are consistent.
Many values fall inside of a confidence interval and will not be rejected by a hypothesis test. “Not rejecting \(H_0\)” is NOT equivalent to accepting \(H_0\text{.}\) When we “do not reject \(H_0\)”, we are asserting that the null value is reasonable, not that the parameter is exactly equal to the null value.
For a 95% confidence interval, 95% is not the probability that the true value lies inside the confidence interval (it either does or it doesn't). Likewise, for a hypothesis test, \(\alpha\) is not the probability that \(H_0\) is true (it either is or it isn't). In both frameworks, the probability is about what would happen in a random sample, not about what is true of the population.
The confidence interval procedures and hypothesis tests described in this book should not be applied unless particular conditions (described in more detail in the following chapters) are met. If these procedures are applied when the conditions are not met, the results may be unreliable and misleading.
While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often errors occur.