Skip to main content

Section A.1 Data sets within the text

Each data set within the text is described in this appendix, and there is a corresponding page for each of these data sets at This page also includes additional data sets that can be used for honing your skills. Each data set has its own page with the following information:

  • Description of each data set.

  • Detailed overview of each data set's variables.

  • CSV download.

  • R object file download.

Over time we will also expand the information available below.

Subsection A.1.1 Chapter 1: Data Collection

In Section 1.1: stent30, stent365 \(\rightarrow\)The stent data is split across two data sets, one for the 0-30 day and one for the 0-365 day results. Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. > NY Times article:

In Section 1.2: loan50, loan_full_schema \(\rightarrow\) This data comes from Lending Club (, which provides a large set of data on the people who received loans through their platform. The data used in the textbook comes from a sample of the loans made in Q1 (Jan, Feb, March) 2018.

In Section 1.2: county, county_complete \(\rightarrow\) These data come from several government sources. For those variables included in the county data set, only the most recent data is reported, as of what was available in late 2018. Data prior to 2011 is all from, where the specific Quick Facts page providing the data is no longer available. The more recent data comes from USDA (, Bureau of Labor Statistics (, SAIPE (, and American Community Survey (

In Section 1.4: The Nurses' Health Study was mentioned. For more information on this data set, see

In Section 1.5: The study we had in mind when discussing the simple randomization (no blocking) study was Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256

Subsection A.1.2 Chapter 2: Summarizing Data

In Section 2.1: loan50, county \(\rightarrow\text{.}\) These data sets are described in the data for Chapter 1.

In Section 2.3: loan50, county \(\rightarrow\) These data sets are described in the data for Chapter 1.

In Section 2.4: malaria \(\rightarrow\) Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716.

Subsection A.1.3 Chapter 3: Probability

In Section 3.1: loan50, county \(\rightarrow\) These data sets are described in the data for Chapter 1.

In Section 3.1: playing_cards \(\rightarrow\) A table of the 52 cards in a standard deck.

In Section 3.2: family_college \(\rightarrow\) A simulated data set based on real population summaries at

In Section 3.2: smallpox \(\rightarrow\) Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.

In Section 3.2: Mammogram screening, probabilities. \(\rightarrow\) The probabilities reported were obtained using studies reported at and

In Section 3.5: stocks_18 \(\rightarrow\) Monthly returns for Caterpillar, Exxon Mobil Corp, and Google for November 2015 to October 2018.

In Section 3.6: fcid \(\rightarrow\) This sample can be considered a simple random sample from the US population. It relies on the USDA Food Commodity Intake Database.

Subsection A.1.4 Chapter 4: Distributions of random variables

In Section 4.1: SAT and ACT score distributions \(\rightarrow\) The SAT score data comes from the 2018 distribution, which is provided at The ACT score data is available at 99 999999 N S N00 ACT-GCPR National.pdf We also acknowledge that the actual ACT score distribution is not nearly normal. However, since the topic is very accessible, we decided to keep the context and examples.

In Section 4.1: possum \(\rightarrow\) The distribution parameters are based on a sample of possums from Australia and New Guinea. The original source of this data is as follows. Lindenmayer DB, et al. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.

In Section 4.1: male_heights_fcid \(\rightarrow\) This sample can be considered a simple random sample from the US population. It relies on the USDA Food Commodity Intake Database.

In Section 4.1: nba_platers_19 \(\rightarrow\) Summary information from the NBA players for the 2018-2019 season. Data were retrieved from

In Section 4.1: poker \(\rightarrow\) Poker winnings (and losses) for 50 days by a professional poker player, which represents their first 50 days trying to play for a living. Anonymity has been requested by the player.

In Section 4.2: run17, run17samp \(\rightarrow\)

Subsection A.1.5 Chapter 5: Foundations for inference

In Section 5.1: pew_energy_2018 \(\rightarrow\) The actual data has more observations than were referenced in this chapter. That is, we used a subsample since it helped smooth some of the examples to have a bit more variability. The pew energy 2018 data set represents the full data set for each of the different energy source questions, which covers solar, wind, offshore drilling, hydrolic fracturing, and nuclear energy. The statistics used to construct the data are from the following page:

In Section 5.2: pew_energy_2018 \(\rightarrow\) See the details for this data set above in Section 5.1.

In Section 5.2: ebola_survey \(\rightarrow\) In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient”. This poll included responses of 1,042 New York adults between Oct 26th and 28th, 2014. Poll ID NY141026 on

In Section 5.3: pew_energy_2018 \(\rightarrow\) See the details for this data set above in Section 5.1.

Subsection A.1.6 Chapter 6: Inference for categorical data

In Section 6.1: Nuclear energy \(\rightarrow\) A Gallup poll of 1,019 adults in the US, conducted in March of 2016, found that 54% of respondents oppose nuclear energy. This was the first time since Gallup first asked the question in 1994 that a majority of respondents said they oppose nuclear energy.

In Section 6.1: Supreme Court \(\rightarrow\) The Gallup organization began measuring the public's view of the Supreme Court's job performance in 2000, and has measured it every year since then with the question: “Do you approve or disapprove of the way the Supreme Court is handling its job?”. In 2018, the Gallup poll randomly sampled 1,033 adults in the U.S. and found that 53% of them approved.

In Section 6.1: Life on other planets \(\rightarrow\) A February 2018 Marist Poll reported: “Many Americans (68%) think there is intelligent life on other planets”. The results were based on a random sample of 1,033 adults in the U.S.

In Section 6.2: cpr \(\rightarrow\) Böttiger et al. Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial. The Lancet, 2001.

In Section 6.2: fish_oil_18 \(\rightarrow\) Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403.

In Section 6.2: mammogram \(\rightarrow\) Miller AB. 2014. Twenty fivve year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study: randomised screening trial. BMJ 2014;348:g366.

In Section 6.2: drone_blades \(\rightarrow\) The quality control data set for quadcopter drone blades is a made-up data set for an example. We provide the simulated data in the drone_blades data set.

In Section 6.3: M&Ms \(\rightarrow\) Starting at the end of 2016, Rick Wicklin, a statistician working at the statistical software company SAS, collected a sample of 712 candies, or about 1.5 pounds, and counted how many there were of each color.

In Section 6.3: ask \(\rightarrow\) Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication. working Ask%20(the%20Right%20Way)%20and%20You%20Shall%20Receive.pdf

In Section 6.4: diabetes2 \(\rightarrow\) Zeitler P, et al. 2012. A Clinical Trial to Maintain Glycemic Control in Youth with Type 2 Diabetes. N Engl J Med.

Subsection A.1.7 Chapter 7: Inference for numerical data

In Section 7.1: Risso's dolphins \(\rightarrow\) Endo T and Haraguchi K. 2009. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 60(5):743-747.

Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins represent a simple random sample from those dolphins.

In Section 7.1: Croaker white fish \(\rightarrow\)

In Section 7.1: run17, run17samp \(\rightarrow\)

In Section 7.2: textbooks, ucla_textbooks_f18 \(\rightarrow\) Data were collected by OpenIntro staff in 2010 and again in 2018. For the 2018 sample, we sampled 201 UCLA courses. Of those, 68 required books that could be found on Amazon. The websites where information was retrieved:, and

In Section 7.3: Jennifer-John \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Science faculty's subtle gender biases favor male students. PNAS October 9, 2012 109 (41) 16474-16479.

In Section 7.3: resume \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013).

In Section 7.3: stem_cells \(\rightarrow\) Menard C, et al. 2005. Transplantation of cardiac-committed mouse embryonic stem cells to infarcted sheep myocardium: a preclinical study. The Lancet: 366:9490, p1005-1012.

Subsection A.1.8 Chapter 8: Introduction to linear regression

In Section 8.1: simulated_scatter \(\rightarrow\) Fake data used for the first three plots. The perfect linear plot uses group 4 data, where group variable in the data set (Figure 8.1.1). The group of 3 imperfect linear plots use groups 1-3 (Figure 8.1.2). The sinusoidal curve uses group 5 data (Figure 8.1.3). The group of 3 scatterplots with residual plots use groups 6-8 (Figure 8.1.13). The correlation plots uses groups 9-19 data (Figure 8.1.14 and Figure 8.1.16).

In Section 8.1: possum \(\rightarrow\) The data is described in the data for Chapter 4

In Section 8.2: elmhurst \(\rightarrow\) These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education:

In Section 8.2: simulated_scatter \(\rightarrow\) The plots for things that can go wrong uses groups 20-23 from Figure 8.3.1.

In Section 8.2: mariokart \(\rightarrow\) Auction data from Ebay ( for the game Mario Kart for the Nintendo Wii. This data set was collected in early October, 2009.

In Section 8.2: simulated_scatter \(\rightarrow\) The plots for types of outliers uses groups 24-29 from Example 8.2.23.

In Section 8.3: midterms_house \(\rightarrow\) Data was retrieved from Wikipedia.

In Section 8.4: county, county_complete \(\rightarrow\) The data is described in the data for Chapter 1