## Section2.5Exercises

### SubsectionExercises

###### 1ACS, Part I

Each year, the US Census Bureau surveys about 3.5 million households with The American Community Survey (ACS). Data collected from the ACS have been crucial in government and policy decisions, helping to determine the allocation of federal and state funds each year. Some of the questions asked on the survey are about their income, age (in years), and gender. The table below contains this information for a random sample of 20 respondents to the 2012 ACS.  1 United States Census Bureau. SUmmary File. 2012 American Commmunity Survey. U.S. Census Bureaus Amerian Community Survey Office, 2013. Web. DEAD LINK

 Income Age Gender 1 53,000 28 male 2 1600 18 female 3 70,000 54 male 4 12,800 22 male 5 1,200 18 female 6 30,000 34 male 7 4,500 21 male 8 20,000 28 female 9 25,000 29 female 10 42,000 33 male
 Income Age Gender 11 670 34 female 12 29,000 55 female 13 44,000 33 female 14 48,000 41 male 15 30,000 47 female 16 60,000 30 male 17 108,000 61 male 18 5,800 50 female 19 50,000 24 female 20 11,000 19 male
1. Create a scatterplot of income vs. age, and describe the relationship between these two variables. Answer

There is a weak and positive relationship between age and income. With so few points it is difficult to tell the form of the relationship (linear or not) however the relationship does look somewhat curved.

2. Now create two scatterplots: one for income vs. age for males and another for females. Answer
3. How, if at all, do the relationships between income and age differ for males and females? Answer

For males as age increases so does income, however this pattern is not apparent for females

###### 2MLB stats

A baseball team's success in a season is usually measured by their number of wins. In order to win, the team has to have scored more points (runs) than their opponent in any given game. As such, number of runs is often a good proxy for the success of the team. The table below shows number of runs, home runs, and batting averages for a random sample of 10 teams in the 2014 Major League Baseball season.  2

 Team Runs Home runs Batting avg. 1 Baltimore 705 211 0.256 2 Boston 634 123 0.244 3 Cincinnati 595 131 0.238 4 Cleveland 669 142 0.253 5 Detroit 757 155 0.277 6 Houston 629 163 0.242 7 Minnesota 715 128 0.254 8 NY Yankees 633 147 0.245 9 Pittsburgh 682 156 0.259 10 San Francisco 665 132 0.255
1. Draw a scatterplot of runs vs. home runs.

2. Draw a scatterplot of runs vs. batting averages.

3. Are home runs or batting averages more strongly associated with number of runs? Explain your reasoning.

The Cereal FACTS report provides information on nutrition content of cereals as well as who they are targeted for (adults, children, families). We have selected a random sample of 20 cereals from the data provided in this report. Shown below are the fiber contents (percentage of fiber per gram of cereal) for these cereals.  3 JL Harris et al. "Cereal FACTS 2012: Limited progress in the nutrition quality and marketing of children''s cereals"." In: Rudd Center for Food Policy ampersand Obesity. 12 (2012)

 Brand Fiber % 1 Pebbles Fruity 0.0% 2 Rice Krispies Treats 0.0% 3 Pebbles Cocoa 0.0% 4 Pebbles Marshmallow 0.0% 5 Frosted Rice Krispies 0.0% 6 Rice Krispies 3.0% 7 Trix 3.1% 8 Honey Comb 3.1% 9 Rice Krispies Gluten Free 3.3% 10 Frosted Flakes 3.3%
 Brand Fiber % 11 Cinnamon Toast Crunch 3.3% 12 Reese's Puffs 3.4% 13 Cheerios Honey Nut 7.1% 14 Lucky Charms 7.4% 15 Pebbles Boulders Chocolate PB 7.4% 16 Corn Pops 9.4% 17 Frosted Flakes Reduced Sugar 10.0% 18 Clifford Crunch 10.0% 19 Apple Jacks 10.7% 20 Dora the Explorer 11.1%
1. Create a stem and leaf plot of the distribution of the fiber content of these cereals. Answer
  0 | 000003333333
0 | 7779
1 | 0011

Legend: 1 | 0 =10

2. Create a dot plot of the fiber content of these cereals. Answer
3. Create a histogram and a relative frequency histogram of the fiber content of these cereals. Answer
4. What percent of cereals contain more than 0.7% fiber? Answer

40%

The Cereal FACTS report from Exercise 2.5.3 also provides information on sugar content of cereals. We have selected a random sample of 20 cereals from the data provided in this report. Shown below are the sugar contents (percentage of sugar per gram of cereal) for these cereals.

 Brand Sugar % 1 Rice Krispies Gluten Free 3% 2 Rice Krispies 12% 3 Dora the Explorer 22% 4 Frosted Flakes Red. Sugar 27% 5 Clifford Crunch 27% 6 Rice Krispies Treats 30% 7 Pebbles Boulders Choc. PB 30% 8 Cinnamon Toast Crunch 30% 9 Trix 31% 10 Honey Comb 31%
 Brand Sugar % 11 Corn Pops 31% 12 Cheerios Honey Nut 32% 13 Reese's Puffs 34% 14 Pebbles Fruity 37% 15 Pebbles Cocoa 37% 16 Lucky Charms 37% 17 Frosted Flakes 37% 18 Pebbles Marshmallow 37% 19 Frosted Rice Krispies 40% 20 Apple Jacks 43%
1. Create a stem and leaf plot of the distribution of the sugar content of these cereals.

2. Create a dot plot of the sugar content of these cereals.

3. Create a histogram and a relative frequency histogram of the sugar content of these cereals.

4. What percent of cereals contain more than 30% sugar?

###### 5Mammal life spans

Data were collected on life spans (in years) and gestation lengths (in days) for 62 mammals. A scatterplot of life span versus length of gestation is shown below.  4 T. Allison and D.V. Cicchetti. "Sleep in mammals: ecological and constitutional correlates". In: Arch. Hydrobiol 75 (1975), p. 442"

1. What type of an association is apparent between life span and length of gestation? Answer

Positive association: mammals with longer gestation periods tend to live longer as well.

2. What type of an association would you expect to see if the axes of the plot were reversed, i.e. if we plotted length of gestation versus life span? Answer

Association would still be positive.

3. Are life span and length of gestation independent? Explain your reasoning. Answer

No, they are not independent. See part a

###### 6Associations, Part I

Indicate which of the plots show a

1. positive association

2. negative association

3. no association

Also determine if the positive and negative associations are linear or nonlinear. Each part may refer to more than one plot.

###### 7Office productivity

Office productivity is relatively low when the employees feel no stress about their work or job security. However, high levels of stress can also lead to reduced employee productivity. Sketch a plot to represent the relationship between stress and productivity.

###### 8Reproducing bacteria

Suppose that there is only sufficient space and nutrients to support one million bacterial cells in a petri dish. You place a few bacterial cells in this petri dish, allow them to reproduce freely, and record the number of bacterial cells in the dish over time. Sketch a plot representing the relationship between number of bacterial cells and time.

###### 9Sleeping in college

A recent article in a college newspaper stated that college students get an average of 5.5 hrs of sleep each night. A student who was skeptical about this value decided to conduct a survey by randomly sampling 25 students. On average, the sampled students slept 6.25 hours per night. Identify which value represents the sample mean and which value represents the claimed population mean.

Population mean = 5.5. Sample mean = 6.25.

###### 10Parameters and statistics

Identify which value represents the sample mean and which value represents the claimed population mean.

1. American households spent an average of about $52 in 2007 on Halloween merchandise such as costumes, decorations and candy. To see if this number had changed, researchers conducted a new survey in 2008 before industry numbers were reported. The survey included 1,500 households and found that average Halloween spending was$58 per household.

2. The average GPA of students in 2001 at a private university was 3.37. A survey on a sample of 203 students from this university yielded an average GPA of 3.59 in Spring semester of 2012.

###### 11Make-up exam

In a class of 25 students, 24 of them took an exam in class and 1 student took a make-up exam the following day. The professor graded the first batch of 24 exams and found an average score of 74 points with a standard deviation of 8.9 points. The student who took the make-up the following day scored 64 points on the exam.

1. Does the new student's score increase or decrease the average score? Answer

Decrease: the new score is smaller than the mean of the 24 previous scores.

2. What is the new average? Answer

Calculate a weighted mean. Use a weight of 24 for the old mean and 1 for the new mean: $(24\times 74 + 1\times64)/(24+1) = 73.6\text{.}$ There are other ways to solve this exercise that do not use a weighted mean.

3. Does the new student's score increase or decrease the standard deviation of the scores? Answer

The new score is more than 1 standard deviation away from the previous mean, so increase.

###### 12Days off at a mining plant

Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager of this plant is under pressure from a local union to increase the amount of paid time off. However, he does not want to give more days off to the workers because that would be costly. Instead he decides he should fire 10 employees in such a way as to raise the average number of days off that are reported by his employees. In order to achieve this goal, should he fire employees who have the most number of days off, least number of days off, or those who have about the average number of days off?

###### 13Smoking habits of UK residents, Part I

A survey was conducted to study the smoking habits of UK residents. The histograms below display the distributions of the number of cigarettes smoked on weekdays and weekends, and they exclude data from people who identified themselves as non-smokers. Describe the two distributions and compare them.  5 Stats4Schools, Smoking.

Both distributions are right skewed and bimodal with modes at 10 and 20 cigarettes; note that people may be rounding their answers to half a pack or a whole pack. The median of each distribution is between 10 and 15 cigarettes. The middle 50% of the data (the IQR) appears to be spread equally in each group and have a width of about 10 to 15. There are potential outliers above 40 cigarettes per day. It appears that respondents who smoke only a few cigarettes (0 to 5) smoke more on the weekdays than on weekends.

###### 14Stats scores, Part I

Below are the final exam scores of twenty introductory statistics students.

 79, 83, 57, 82, 94, 83, 72, 74, 73, 71, 66, 89, 78, 81, 78, 81, 88, 69, 77, 79

Draw a histogram of these data and describe the distribution.

###### 15Smoking habits of UK residents, Part II

A random sample of 5 smokers from the data set discussed in Exercise 2.5.13 is provided below.

 gender age maritalStatus grossIncome smoke Weekends Weekdays Female 51 Married £2,600 to £5,200 Yes 20 cig/day 20 cig/day Male 24 Single £10,400 to £15,600 Yes 20 cig/day 15 cig/day Female 33 Married £10,400 to £15,600 Yes 20 cig/day 10 cig/day Female 17 Single £5,200 to £10,400 Yes 20 cig/day 15 cig/day Female 76 Widowed £5,200 to £10,400 Yes 20 cig/day 20 cig/day
1. Find the mean amount of cigarettes smoked on weekdays and weekends by these 5 respondents. Answer

$\bar{x}_{amtWeekends} = 20\text{,}$ $\bar{x}_{amtWeekdays} = 16\text{.}$

2. Find the standard deviation of the amount of cigarettes smoked on weekdays and on weekends by these 5 respondents. Is the variability higher on weekends or on weekdays? Answer

$s_{amtWeekends} = 0\text{,}$ $s_{amtWeekdays} = 4.18\text{.}$ In this very small sample, higher on weekdays.

###### 16Factory defective rate

A factory quality control manager decides to investigate the percentage of defective items produced each day. Within a given work week (Monday through Friday) the percentage of defective items produced was 2%, 1.4%, 4%, 3%, 2.2%.

1. Calculate the mean for these data.

2. Calculate the standard deviation for these data, showing each step in detail.

###### 17Medians and IQRs

For each part, compare distributions (1) and (2) based on their medians and IQRs. You do not need to calculate these statistics; simply state how the medians and IQRs compare. Make sure to explain your reasoning.

1. .

1. 3, 5, 6, 7, 9

2. 3, 5, 6, 7, 20

Both distributions have the same median and IQR.

2. .

1. 3, 5, 6, 7, 9

2. 3, 5, 7, 8, 9

Second distribution has a higher median and IQR.

3. .

1. 1, 2, 3, 4, 5

2. 6, 7, 8, 9, 10

Second distribution has higher median. IQRs are equal.

4. .

1. 0, 10, 50, 60, 100

2. 0, 100, 500, 600, 1000

Second distribution has higher median and larger IQR

###### 18Means and SDs

For each part, compare distributions (1) and (2) based on their means and standard deviations. You do not need to calculate these statistics; simply state how the means and the standard deviations compare. Make sure to explain your reasoning. Hint: It may be useful to sketch dot plots of the distributions.

1. .

1. 3, 5, 5, 5, 8, 11, 11, 11, 13

2. 3, 5, 5, 5, 8, 11, 11, 11, 20

2. .

1. -20, 0, 0, 0, 15, 25, 30, 30

2. -40, 0, 0, 0, 15, 25, 30, 30

3. .

1. 0, 2, 4, 6, 8, 10

2. 20, 22, 24, 26, 28, 30

4. .

1. 100, 200, 300, 400, 500

2. 0, 50, 300, 550, 600

###### 19Stats scores, Part II

Create a box plot for the final exam scores of twenty introductory statistics students given in Exercise 2.5.14. The five number summary provided below may be useful.

 Min Q1 Q2 (Median) Q3 Max 57 72.5 78.5 82.5 94
###### 20Infant mortality

The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The relative frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries.  6 CIA Factbook, Country Comparison: Infant Mortality Rate, 2012.

1. Estimate Q1, the median, and Q3 from the histogram.

2. Would you expect the mean of this data set to be smaller or larger than the median? Explain your reasoning.

###### 21Matching histograms and box plots

Describe the distribution in the histograms below and match them to the box plots.

Descriptions will vary a little.

(a) 2. Unimodal, symmetric, centered at 60, standard deviation of roughly 3/

(b) 3. Symmetric and approximately evenly distributed from 0 to 100.

(c) 1. Right skewed, unimodal, centered at about 1.5, with most observations falling between 0 and 3. A very small fraction of observations exceed a value of 5.

###### 22Air quality

Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency. This index reports the pollution level and what associated health effects might be a concern. The index is calculated for five major air pollutants regulated by the Clean Air Act and takes values from 0 to 300, where a higher value indicates lower air quality. AQI was reported for a sample of 91 days in 2011 in Durham, NC. The relative frequency histogram below shows the distribution of the AQI values on these days. 7 US Environmental Protection Agency, AirData, 2011.

1. Estimate the median AQI value of this sample.

2. Would you expect the mean AQI value of this sample to be higher or lower than the median? Explain your reasoning.

3. Estimate $Q_1\text{,}$ $Q_3\text{,}$ and IQR for the distribution.

###### 23Histograms and box plots

Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?

The histogram shows that the distribution is bimodal, which is not apparent in the box plot. The box plot makes it easy to identify more precise values of observations outside of the whiskers.

###### 24Marathon winners

The histogram and box plots below show the distribution of finishing times for male and female winners of the New York Marathon between 1970 and 1999.

1. What features of the distribution are apparent in the histogram and not the box plot? What features are apparent in the box plot but not in the histogram?

2. What may be the reason for the bimodal distribution? Explain.

3. Compare the distribution of marathon times for men and women based on the box plot shown below.

4. The time series plot shown below is another way to look at these data. Describe what is visible in this plot but not in the others.

###### 25ACS, Part II

The hollow histograms below show the distribution of incomes of respondents to the American Community Survey introduced in Exercise 2.5.1.

1. Compare the distributions of incomes of males and females. Answer

Both distributions are right skewed however the distribution of incomes of males has a much higher median (around $40K) compared to females (around$20K).

2. Suggest an alternative visualization for displaying and comparing the distributions of incomes of males and females. Answer

We could also use side-by-side box plots for displaying and easily comparing the distributions of incomes of males and females.

###### 26AP Stats

The table below shows scores (out of 100) of twenty college students on a college level statistical reasoning test given at the beginning of the semester in their introductory statistics course. Ten of these students have taken AP Stats in high school, and the other ten have not taken AP Stats.

• Took AP stats: 52.5, 57.5, 60, 65, 70, 70, 72.5, 77.5, 80, 85

• Did not take AP stats: 40, 45, 45, 50, 52.5, 57.5, 57.5, 60, 65, 72.5

1. Create a relative frequency histogram of all students' scores on the statistical reasoning test.

2. What percent of all students scored above 50 on this test?

3. Compare the performances of students who did and did not take AP stats. The side-by-side box plots and the hollow histograms shown below might be helpful for this task.

###### 27Distributions and appropriate statistics, Part I

For each of the following, describe whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data and whether the variability of observations would be best represented using the standard deviation or IQR.

1. Number of pets per household. Answer

The distribution of number of pets per household is likely right skewed as there is a natural boundary at 0 and only a few people have many pets. Therefore the center would be best described by the median, and variability would be best described by the IQR.

2. Distance to work, i.e. number of miles between work and home. Answer

The distribution of number of distance to work is likely right skewed as there is a natural boundary at 0 and only a few people live a very long distance from work. Therefore the center would be best described by the median, and variability would be best described by the IQR.

The distribution of heights of males is likely symmetric. Therefore the center would be best described by the mean, and variability would be best described by the standard deviation.

###### 28Distributions and appropriate statistics, Part II

For each of the following, describe whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR.

1. Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below$450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than$6,000,000.

2. Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below$600,000, 75% of the houses cost below $900,000 and very few houses that cost more than$1,200,000.

3. Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don't drink since they are under 21 years old, and only a few drink excessively.

4. Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than all the other employees.

###### 29TV watchers

Students in an AP Statistics class were asked how many hours of television they watch per week (including online streaming). This sample yielded an average of 4.71 hours, with a standard deviation of 4.18 hours. Is the distribution of number of hours students watch television weekly symmetric? If not, what shape would you expect this distribution to have? Explain your reasoning.

No, we would expect this distribution to be right skewed. There are two reasons for this: (1) there is a natural boundary at 0 (it is not possible to watch less than 0 hours of TV), (2) the standard deviation of the distribution is very large compared to the mean.

###### 30Exam scores

The average on a history exam (scored out of 100 points) was 85, with a standard deviation of 15. Is the distribution of the scores on this exam symmetric? If not, what shape would you expect this distribution to have? Explain your reasoning.

Facebook data indicate that 50% of Facebook users have 100 or more friends, and that the average friend count of users is 190. What do these findings suggest about the shape of the distribution of number of friends of Facebook users?  8 Lars Backstrom. "Anatomy of Facebook". In: Facebook Data Teams Notes (2011).

The statement “50% of Facebook users have over 100 friends” means that the median number of friends is 100, which is lower than the mean number of friends (190), which suggests a right skewed distribution for the number of friends of Facebook users.

###### 32A new statistic

The statistic $\frac{\bar{x}}{median}$ can be used as a measure of skewness. Suppose we have a distribution where all observations are greater than 0, $x_i \gt 0\text{.}$ What is the expected shape of the distribution under the following conditions? Explain your reasoning.

1. $\frac{\bar{x}}{median} = 1$

2. $\frac{\bar{x}}{median} \lt 1$

3. $\frac{\bar{x}}{median} \gt 1$

###### 33Income at the coffee shop, Part I

The first histogram below shows the distribution of the yearly incomes of 40 patrons at a college coffee shop. Suppose two new people walk into the coffee shop: one making $225,000 and the other$250,000. The second histogram shows the new income distribution. Summary statistics are also provided.

 (1) (2) n 40 42 Min. 60,680 60,680 1st Qu. 63,620 63,710 Median 65,240 65,350 Mean 65,090 73,300 3rd Qu. 66,160 66,540 Max. 69,890 250,000 SD 2,122 37,321
1. Would the mean or the median best represent what we might think of as a typical income for the 42 patrons at this coffee shop? What does this say about the robustness of the two measures? Answer

The median is better; the mean is substantially affected by the two extreme observations.

2. Would the standard deviation or the IQR best represent the amount of variability in the incomes of the 42 patrons at this coffee shop? What does this say about the robustness of the two measures? Answer

The IQR is better; the standard deviation, like the mean, is substantially affected by the two high salaries.

###### 34Midrange

The midrange of a distribution is defined as the average of the maximum and the minimum of that distribution. Is this statistic robust to outliers and extreme skew? Explain your reasoning

###### 35Commute times, Part I

The histogram below shows the distribution of mean commute times in 3,143 US counties in 2010. Describe the distribution and comment on whether or not a log transformation may be advisable for these data.

The distribution is unimodal and symmetric with a mean of about 25 minutes and a standard deviation of about 5 minutes. There does not appear to be any counties with unusually high or low mean travel times. Since the distribution is already unimodal and symmetric, a log transformation is not necessary.

###### 36Hispanic population, Part I

The histogram below shows the distribution of the percentage of the population that is Hispanic in 3,143 counties in the US in 2010. Also shown is a histogram of logs of these values. Describe the distribution and comment on why we might want to use log-transformed values in analyzing or modeling these data.

###### 37Income at the coffee shop, Part II

Suppose each of the 40 people in the coffee shop in Exercise 2.5.33 got a 5% raise. What would the new mean, median, and the standard deviation of their incomes be?

mean = $65,090 \times 1.05 = 68,344.50\text{;}$ median = $65,240 \times 1.05 = 68,502\text{;}$ sd = $2,122 \times 1.05 = 2,228.10$

###### 38LA weather

The temperatures in June in Los Angeles have a mean of 77°F, with a standard deviation of 5 °F. To convert from Celsius to Fahrenheit, we use the following conversion:

\begin{equation*} x_{C} = (x_{F} - 32) \frac{5}{9} \end{equation*}
1. What is the mean temperature in June in LA in degrees Celcius?

2. What is the standard deviation of temperatures in June in LA in degrees Celcius?

###### 39Smoking habits of UK residents, Part III

The UK residents in Exercise 2.5.15 smoke on average 16 cigarettes per day on weekdays, with a standard deviation of 4.18. Suppose these residents participated in a smoking cessation program and at the end of the first week of the program reduced their weekday smoking by 3 cigarettes / day. Find the new mean and standard deviation of the number of cigarettes they smoke on weekdays.

mean = 16 - 3 = 13; sd = 4.18

###### 40Stats scores, Part III

The introductory statistics students in Exercise 2.5.14 scored on average 77.7 points, with a standard deviation of 8.44. The median score was 78.5, and the IQR was 9.5. Suppose these students completed an extra credit exercise that earned them additional two points on their exams. Calculate the new mean, median, standard deviation, and IQR of their scores.

###### 41Commute times, Part II

Exercise 2.5.35 displays histograms of mean commute times in 3,143 US counties in 2010. Describe the spatial distribution of commuting times using the map below.

Answers will vary. There are pockets of longer travel time around DC, Southeastern NY, Chicago, Minneapolis, Los Angeles, and many other big cities. There is also a large section of shorter average commute times that overlap with farmland in the Midwest. Many farmers' homes are adjacent to their farmland, so their commute would be 0 minutes, which may explain why the average commute time for these counties is relatively low.

###### 42Hispanic population, Part II

Exercise 2.5.36 displays histograms of the distribution of the percentage of the population that is Hispanic in 3,143 counties in the US in 2010.

1. What features of this distribution are apparent in the map but not in the histogram?

2. What features are apparent in the histogram but not the map?

3. Is one visualization more appropriate or helpful than the other? Explain your reasoning.

###### 43Antibiotic use in children

The bar plot and the pie chart below show the distribution of pre-existing medical conditions of children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.

1. What features are apparent in the bar plot but not in the pie chart? Answer

We see the order of the categories and the relative frequencies in the bar plot.

2. What features are apparent in the pie chart but not in the bar plot? Answer

There are no features that are apparent in the pie chart but not in the bar plot.

3. Which graph would you prefer to use for displaying these categorical data? Answer

We usually prefer to use a bar plot as we can also see the relative frequencies of the categories in this graph.

###### 44Views on immigration

910 randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country. The results of the survey by political ideology are shown below. 9 SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.

 Political ideology Conservative Moderate Liberal Total Response (i) Apply for citizenship 57 120 101 278 (ii) Guest worker 121 113 28 262 (iii) Leave the country 179 126 45 350 (iv) Not sure 15 4 1 20 Total 372 363 175 910
1. What percent of these Tampa, FL voters identify themselves as conservatives? Answer

2. What percent of these Tampa, FL voters are in favor of the citizenship option? Answer

3. What percent of these Tampa, FL voters identify themselves as conservatives and are in favor of the citizenship option? Answer

4. What percent of these Tampa, FL voters who identify themselves as conservatives are also in favor of the citizenship option? What percent of moderates and liberal share this view? Answer

5. Do political ideology and views on immigration appear to be independent? Explain your reasoning. Answer

###### 45Side effects of Avandia, Part I

Rosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death. A common alternative treatment is pioglitazone, the active ingredient in a diabetes medicine called Actos. In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using rosiglitazone and 5,386 of the 159,978 using pioglitazone had serious cardiovascular problems. These data are summarized in the contingency table below.  10 D.J. Graham et al. “Risk of acute myocardial infarction, stroke, heart failure, and death in elderly Medicare patients treated with rosiglitazone or pioglitazone”. In: JAMA 304.4 (2010), p. 411. ISSN:0098-7484.

 Cardiovascular problems Yes No Total Treatment Rosiglitazone 2,593 65,000 67,593 Pioglitazone 5,386 154,592 159,978 Total 7,979 219,592 227,571

Determine if each of the following statements is true or false. If false, explain why. Be careful: The reasoning may be wrong even if the statement's conclusion is correct. In such cases, the statement should be considered false.

1. Since more patients on pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a pioglitazone treatment is higher. Answer

False. Instead of comparing counts, we should compare percentages.

2. The data suggest that diabetic patients who are taking rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4% for patients on pioglitazone. Answer

True.

3. The fact that the rate of incidence is higher for the rosiglitazone group proves that rosiglitazone causes serious cardiovascular problems. Answer

False. We cannot infer a causal relationship from an association in an observational study. However, we can say the drug a person is on affects his risk in this case, as he chose that drug and his choice may be associated with other variables, which is why part (b) is true. The difference in these statements is subtle but important.

4. Based on the information provided so far, we cannot tell if the difference between the rates of incidences is due to a relationship between the two variables or due to chance. Answer

True.

###### 46Heart transplants

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study. Of the 34 patients in the control group, 4 were alive at the end of the study. Of the 69 patients in the treatment group, 24 were alive. The contingency table below summarizes these results.  11 B. Turnbull et al. "Survivorship of Heart Transplant Data". In: Journal of the American Statistical Association 69 (1974), pp. 74-80.

 Group Control Treatment Total Outcome Alive 4 24 28 Dead 30 45 75 Total 34 69 103
1. What proportion of patients in the treatment group and what proportion of patients in the control group died?

2. One approach for investigating whether or not the treatment is effective is to use a randomization technique.

1. What are the claims being testsed?

2. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.

We write alive on _____ cards representing patients who were alive at the end of the study, and dead on _____ cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size _____ representing treatment, and another group of size _____ representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at _____. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are _____. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis (independence model) should be rejected in favor of the alternative.

3. What do the simulation results shown below suggest about the effectiveness of the transplant program?

###### 47Side effects of Avandia, Part II

Exercise 2.5.45 introduces a study that compares the rates of serious cardiovascular problems for diabetic patients on rosiglitazone and pioglitazone treatments. The table below summarizes the results of the study.

 Cardiovascular problems Yes No Total Treatment Rosiglitazone 2,593 65,000 67,593 Pioglitazone 5,386 154,592 159,978 Total 7,979 219,592 227,571

Proportion who had heart attack: $\frac{7,979}{227,571} \approx 0.035$

2. If the type of treatment and having cardiovascular problems were independent, about how many patients in the rosiglitazone group would we expect to have had cardiovascular problems? Answer

Expected number of cardiovascular problems in the rosiglitazone group if having cardiovascular problems and treatment were independent can be calculated as the number of patients in that group multiplied by the overall rate of cardiovascular problems in the study: $\text{67,593} \times \frac{\text{7,979}}{\text{227,571}} \approx 2370\text{.}$

3. We can investigate the relationship between outcome and treatment in this study using a randomization technique. While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards. In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether or not each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978. We repeat this simulation 1,000 times and each time record the number of people in the rosiglitazone group who had cardiovascular problems. Below is a relative frequency histogram of these counts.

1. What are the claims being testsed?

$H_0\text{:}$ Independence model. The treatment and cardiovascular problems are independent. They have no relationship, and the difference in incidence rates between the rosiglitazone and pioglitazone groups is due to chance. $H_A\text{:}$ Alternate model. The treatment and cardiovascular problems are not independent. The difference in the incidence rates between the rosiglitazone and pioglitazone groups is not due to chance, and rosiglitazone is associated with an increased risk of serious cardiovascular problems.

2. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, more or fewer patients with cardiovascular problems in the rosiglitazone group?

A higher number of patients with cardiovascular problems in the rosiglitazone group than expected under the assumption of independence would provide support for the alternative hypothesis. This would suggest that rosiglitazone increases the risk of such problems.

3. What do the simulation results suggest about the relationship between taking rosiglitazone and having cardiovascular problems in diabetic patients?

In the actual study, we observed 2,593 cardiovascular events in the rosiglitazone group. In the 1,000 simulations under the independence model, we observed somewhat less than 2,593 in all simulations, which suggests that the actual results did not come from the independence model. That is, the analysis provides strong evidence that the variables are not independent, and we reject the independence model in favor of the alternative. The study's results provide strong evidence that rosiglitazone is associated with an increased risk of cardiovascular problems.

###### 48Sinusitis and antibiotics, Part II

Researchers studying the effect of antibiotic treatment compared to symptomatic treatment for acute sinusitis randomly assigned 166 adults diagnosed with sinusitis into two groups (as discussed in Exercise 1.6.2). Participants in the antibiotic group received a 10-day course of an antibiotic, and the rest received symptomatic treatments as a placebo. These pills had the same taste and packaging as the antibiotic. At the end of the 10-day period patients were asked if they experienced improvement in symptoms since the beginning of the study. The distribution of responses is summarized below. 12 J.M. Garbutt et al. “Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial”. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685-692.

 Self reportedimprovement in symptoms Yes No Total Treatment Antibiotic 66 19 85 Placebo 65 16 81 Total 131 35 166
1. What type of a study is this?

2. Does this study make use of blinding?

3. At first glance, does antibiotic or placebo appear to be more effective for the treatment of sinusitis? Explain your reasoning using appropriate statistics.

4. There are two competing claims that this study is used to compare: the independence model and the alternative model. Write out these competing claims in easy-to-understand language and in the context of the application. Hint: The researchers are studying the effectiveness of antibiotic treatment.

5. Based on your finding in (c), does the evidence favor the alternative model? If not, then explain why. If so, what would you do to check if whether this is strong evidence?