Section A.1 Data sets within the text
Each data set within the text is described in this appendix, and there is a corresponding page for each of these data sets at openintro.org/data. This page also includes additional data sets that can be used for honing your skills. Each data set has its own page with the following information:
Description of each data set.
Detailed overview of each data set's variables.
R object file download.
Over time we will also expand the information available below.
Subsection A.1.1 Chapter 1: Data Collection
In Section 1.1:
stent365 \(\rightarrow\)The stent data is split across two data sets, one for the 0-30 day and one for the 0-365 day results. Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. >www.nejm.org/doi/full/10.1056/NEJMoa1105335. NY Times article: www.nytimes.com/2011/09/08/health/research/08stent.html.
In Section 1.2:
loan_full_schema \(\rightarrow\) This data comes from Lending Club (lendingclub.com), which provides a large set of data on the people who received loans through their platform. The data used in the textbook comes from a sample of the loans made in Q1 (Jan, Feb, March) 2018.
In Section 1.2:
county_complete \(\rightarrow\) These data come from several government sources. For those variables included in the county data set, only the most recent data is reported, as of what was available in late 2018. Data prior to 2011 is all from census.gov, where the specific Quick Facts page providing the data is no longer available. The more recent data comes from USDA (ers.usda.gov), Bureau of Labor Statistics (bls.gov/lau), SAIPE (census.gov/did/www/saipe), and American Community Survey (census.gov/programs-surveys/acs).
In Section 1.5: The study we had in mind when discussing the simple randomization (no blocking) study was Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256
Subsection A.1.2 Chapter 2: Summarizing Data
In Section 2.4:
malaria \(\rightarrow\) Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. www.pnas.org/content/114/10/2711
Subsection A.1.3 Chapter 3: Probability
In Section 3.2: Mammogram screening, probabilities. \(\rightarrow\) The probabilities reported were obtained using studies reported at www.breastcancer.org and www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421.
Subsection A.1.4 Chapter 4: Distributions of random variables
In Section 4.1: SAT and ACT score distributions \(\rightarrow\) The SAT score data comes from the 2018 distribution, which is provided at reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf. The ACT score data is available at act.org/content/dam/act/unsecured/documents/cccr2018/P 99 999999 N S N00 ACT-GCPR National.pdf We also acknowledge that the actual ACT score distribution is not nearly normal. However, since the topic is very accessible, we decided to keep the context and examples.
In Section 4.1:
possum \(\rightarrow\) The distribution parameters are based on a sample of possums from Australia and New Guinea. The original source of this data is as follows. Lindenmayer DB, et al. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
In Section 4.1:
poker \(\rightarrow\) Poker winnings (and losses) for 50 days by a professional poker player, which represents their first 50 days trying to play for a living. Anonymity has been requested by the player.
Subsection A.1.5 Chapter 5: Foundations for inference
In Section 5.1:
pew_energy_2018 \(\rightarrow\) The actual data has more observations than were referenced in this chapter. That is, we used a subsample since it helped smooth some of the examples to have a bit more variability. The pew energy 2018 data set represents the full data set for each of the different energy source questions, which covers solar, wind, offshore drilling, hydrolic fracturing, and nuclear energy. The statistics used to construct the data are from the following page: www.pewinternet.org/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/
In Section 5.2:
ebola_survey \(\rightarrow\) In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient”. This poll included responses of 1,042 New York adults between Oct 26th and 28th, 2014. Poll ID NY141026 on maristpoll.marist.edu.
Subsection A.1.6 Chapter 6: Inference for categorical data
In Section 6.1: Nuclear energy \(\rightarrow\) A Gallup poll of 1,019 adults in the US, conducted in March of 2016, found that 54% of respondents oppose nuclear energy. This was the first time since Gallup first asked the question in 1994 that a majority of respondents said they oppose nuclear energy. https://news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
In Section 6.1: Supreme Court \(\rightarrow\) The Gallup organization began measuring the public's view of the Supreme Court's job performance in 2000, and has measured it every year since then with the question: “Do you approve or disapprove of the way the Supreme Court is handling its job?”. In 2018, the Gallup poll randomly sampled 1,033 adults in the U.S. and found that 53% of them approved. https://news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
In Section 6.1: Life on other planets \(\rightarrow\) A February 2018 Marist Poll reported: “Many Americans (68%) think there is intelligent life on other planets”. The results were based on a random sample of 1,033 adults in the U.S. http://maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion
In Section 6.2:
cpr \(\rightarrow\) Böttiger et al. Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial. The Lancet, 2001.
In Section 6.2:
mammogram \(\rightarrow\) Miller AB. 2014. Twenty fivve year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study: randomised screening trial. BMJ 2014;348:g366.
In Section 6.3: M&Ms \(\rightarrow\) Starting at the end of 2016, Rick Wicklin, a statistician working at the statistical software company SAS, collected a sample of 712 candies, or about 1.5 pounds, and counted how many there were of each color. https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics
In Section 6.3:
ask \(\rightarrow\) Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication. opim.wharton.upenn.edu/DPlab/papers/workingPapers/Minson working Ask%20(the%20Right%20Way)%20and%20You%20Shall%20Receive.pdf
Subsection A.1.7 Chapter 7: Inference for numerical data
In Section 7.1: Risso's dolphins \(\rightarrow\) Endo T and Haraguchi K. 2009. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 60(5):743-747.
Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins represent a simple random sample from those dolphins.
In Section 7.1: Croaker white fish \(\rightarrow\) www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm
In Section 7.2:
ucla_textbooks_f18 \(\rightarrow\) Data were collected by OpenIntro staff in 2010 and again in 2018. For the 2018 sample, we sampled 201 UCLA courses. Of those, 68 required books that could be found on Amazon. The websites where information was retrieved: sa.ucla.edu/ro/public/soc, ucla.verbacompare.com and amazon.com.
In Section 7.3: Jennifer-John \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Science faculty's subtle gender biases favor male students. PNAS October 9, 2012 109 (41) 16474-16479. https://www.pnas.org/content/109/41/16474
In Section 7.3:
resume \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013). www.nber.org/papers/w9873
In Section 7.3:
stem_cells \(\rightarrow\) Menard C, et al. 2005. Transplantation of cardiac-committed mouse embryonic stem cells to infarcted sheep myocardium: a preclinical study. The Lancet: 366:9490, p1005-1012. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext
Subsection A.1.8 Chapter 8: Introduction to linear regression
In Section 8.1:
simulated_scatter \(\rightarrow\) Fake data used for the first three plots. The perfect linear plot uses group 4 data, where group variable in the data set (Figure 8.1.1). The group of 3 imperfect linear plots use groups 1-3 (Figure 8.1.2). The sinusoidal curve uses group 5 data (Figure 8.1.3). The group of 3 scatterplots with residual plots use groups 6-8 (Figure 8.1.13). The correlation plots uses groups 9-19 data (Figure 8.1.14 and Figure 8.1.16).
In Section 8.2:
elmhurst \(\rightarrow\) These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: chronicle.com/article/What-Students-Really-Pay-to-Go/131435.