Section 3.4 Simulations
¶What is the probability of getting a sum greater than 16 in three rolls of a die? Finding all possible combinations that satisfy this would be tedious, but we could conduct a physical simulation or a computer simulation to estimate this probability. With modern computing power, simulations have become an important and powerful tool for data scientists. In this section, we will look at the concepts that underlie simulations.
Subsection 3.4.1 Learning objectives
Understand the purpose of a simulation and recognize the application of the long-run relative frequency interpretation of probability.
Understand how random digit tables work and how to assign digits to outcomes.
Be able to repeat a simulation a set number of trials or until a condition is true, and use the results to estimate the probability of interest.
Subsection 3.4.2 Setting up and carrying out simulations
In the previous section we saw how to apply the binomial formula to find the probability of exactly \(x\) successes in \(n\) independent trials when a success has probability \(p\text{.}\) Sometimes we have a problem we want to solve but we don't know the appropriate formula, or even worse, a formula may not exist. In this case, one common approach is to estimate the probability using simulations.
You may already be familiar with simulations. Want to know the probability of rolling a sum of 7 with a pair of dice? Roll a pair of dice many, many, many times and see what proportion of times the sum is 7. The more times you roll the pair of dice, the better the estimate will tend to be. Of course, such experiments can be time consuming or even infeasible.
In this section, we consider simulations using random numbers. Random numbers (or technically, psuedo-random numbers) can be produced using a calculator or computer. Random digits are produced such that each digit, 0-9, is equally likely to come up in each spot. You'll find that occasionally we may have the same number in a row — sometimes multiple times — but in the long run, each digit should appear 1/10th of the time.
Column | |||||||||
Row | 1-5 | 6-10 | 11-15 | 16-20 | |||||
1 | 43087 | 41864 | 51009 | 39689 | |||||
2 | 63432 | 72132 | 40269 | 56103 | |||||
3 | 19025 | 83056 | 62511 | 52598 | |||||
4 | 85117 | 16706 | 31083 | 24816 | |||||
5 | 16285 | 56280 | 01494 | 90240 | |||||
6 | 94342 | 18473 | 50845 | 77757 | |||||
7 | 61099 | 14136 | 39052 | 50235 | |||||
8 | 37537 | 58839 | 56876 | 02960 | |||||
9 | 04510 | 16172 | 90838 | 15210 | |||||
10 | 27217 | 12151 | 52645 | 96218 | |||||
Example 3.4.2.
Mika's favorite brand of cereal is running a special where 20% of the cereal boxes contain a prize. Mika really wants that prize. If her mother buys 6 boxes of the cereal over the next few months, what is the probability Mika will get a prize?
To solve this problem using simulation, we need to be able to assign digits to outcomes. Each box should have a 20% chance of having a prize and an 80% chance of not having a prize. Therefore, a valid assignment would be:
Of the ten possible digits (0, 1, 2, ..., 8, 9), two of them, i.e. 20% of them, correspond to winning a prize, which exactly matches the odds that a cereal box contains a prize.
In Mika's simulation, one trial will consist of 6 boxes of cereal, and therefore a trial will require six digits (each digit will correspond to one box of cereal). We will repeat the simulation for 20 trials. Therefore we will need 20 sets of 6 digits. Let's begin on row 1 of the random digit table, shown in Table 3.4.1. If a trial consisted of 5 digits, we could use the first 5 digits going across: 43087. Because here a trial consists of 6 digits, it may be easier to read down the table, rather than read across. We will let trial 1 consist of the first 6 digits in column 1 (461819), trial 2 consist of the first 6 digits in column 2 (339564), etc. For this simulation, we will end up using the first 6 rows of each of the 20 columns.
In trial 1, there are two 1's, so we record that as a success; in this trial there were actually two prizes. In trial 2 there were no 0's or 1's, therefore we do not record this as a success. In trial 3 there were three prizes, so we record this as a success. The rest of this exercise is left as a Guided Practice problem for you to complete.
Checkpoint 3.4.3.
Finish the simulation above and report the estimate for the probability that Mika will get a prize if her mother buys 6 boxes of cereal where each one has a 20% chance of containing a prize. 1
Checkpoint 3.4.4.
In the previous example, the probability that a box of cereal contains a prize is 20%. The question presented is equivalent to asking, what is the probability of getting at least one prize in six randomly selected boxes of cereal. This probability question can be solved explicitly using the method of complements. Find this probability. How does the estimate arrived at by simulation compare to this probability? 2
We can also use simulations to estimate quantities other than probabilities. Consider the following example.
Example 3.4.5.
Let's say that instead of buying exactly 6 boxes of cereal, Mika's mother agrees to buy boxes of this cereal until she finds one with a prize. On average, how many boxes of cereal would one have to buy until one gets a prize?
For this question, we can use the same digit assignment. However, our stopping rule is different. Each trial may require a different number of digits. For each trial, the stopping rule is: look at digits until we encounter a 0 or a 1. Then, record how many digits/boxes of cereal it took. Repeat the simulation for 20 trials, and then average the numbers from each trial.
Let's begin again at row 1. We can read across or down, depending upon what is most convenient. Since there are 20 columns and we want 20 trials, we will read down the columns. Starting at column 1, we count how many digits (boxes of cereal) we encounter until we reach a 0 or 1 (which represent a prize). For trial 1 we see 461, so we record 3. For trial 2 we see 3395641, so we record 7. For trial 3, we see 0, so we record 1. The rest of this exercise is left as a Guided Practice problem for you to complete.
Checkpoint 3.4.6.
Finish the simulation above and report your estimate for the average number of boxes of cereal one would have to buy until encountering a prize, where the probability of a prize in each box is 20%. 3
Example 3.4.7.
Now, consider a case where the probability of interest is not 20%, but rather 28%. Which digits should correspond to success and which to failure?
This example is more complicated because with only 10 digits, there is no way to select exactly 28% of them. Therefore, each observation will have to consist of two digits. We can use two digits at a time and assign pairs of digits as follows:
Checkpoint 3.4.8.
Assume the probability of winning a particular casino game is 45%. We want to carry out a simulation to estimate the probability that we will win at least 5 times in 10 plays. We will use 30 trials of the simulation. Assign digits to outcomes. Also, how many total digits will we require to run this simulation? 4
Checkpoint 3.4.9.
Assume carnival spinner has 7 slots. We want to carry out a simulation to estimate the probability that we will win at least 10 times in 60 plays. Repeat 100 trials of the simulation. Assign digits to outcomes. Also, how many total digits will we require to run this simulation? 5
Does anyone perform simulations like this? Sort of. Simulations are used a lot in statistics, and these often require the same principles covered in this section to properly set up those simulations. The difference is in implementation after the setup. Rather than use a random number table, a data scientist will write a program that uses a pseudo-random number generator in a computer to run the simulations very quickly — often times millions of trials each second, which provides much more accurate estimates than running a couple dozen trials by hand.
Subsection 3.4.3 Section summary
When a probability is difficult to determine via a formula, one can set up a simulation to estimate the probability.
The relative frequency theory of probability and the Law of Large Numbers are the mathematical underpinning of simulations. A larger number of trials should tend to produce better estimates.
The first step to setting up a simulation is to assign digits to represent outcomes. This should be done in such a way as to give the event of interest the correct probability. Then, using a random number table, calculator, or computer, generate random digits (outcomes). Repeat this a specified number of trials or until a given stopping rule. When this is finished, count up how many times the event happened and divide that by the number of trials to get the estimate of the probability.
Exercises 3.4.4 Exercises
1. Smog check, Part I.
Suppose 16% of cars fail pollution tests (smog checks) in California. We would like to estimate the probability that an entire fleet of seven cars would pass using a simulation. We assume each car is independent. We only want to know if the entire fleet passed, i.e. none of the cars failed. What is wrong with each of the following simulations to represent whether an entire (simulated) fleet passed?
Flip a coin seven times where each toss represents a car. A head means the car passed and a tail means it failed. If all cars passed, we report PASS for the fleet. If at least one car failed, we report FAIL.
Read across a random number table starting at line 5. If a number is a 0 or 1, let it represent a failed car. Otherwise the car passes. We report PASS if all cars passed and FAIL otherwise.
Read across a random number table, looking at two digits for each simulated car. If a pair is in the range [00-16], then the corresponding car failed. If it is in [17-99], the car passed. We report PASS if all cars passed and FAIL otherwise.
(a) \(P(\text{pass}) = 0.5\text{,}\) but it should be 0.16.
(b) \(P(\text{pass}) = 0.2\text{,}\) instead of 0.16.
(c) \(P(\text{pass}) = 0.17\text{,}\) instead of 0.16.
2. Left-handed.
Studies suggest that approximately 10% of the world population is left-handed. Use ten simulations to answer each of the following questions. For each question, describe your simulation scheme clearly.
What is the probability that at least one out of eight people are left-handed?
On average, how many people would you have to sample until the first person who is left-handed?
On average, how many left-handed people would you expect to find among a random sample of six people?
3. Smog check, Part II.
Consider the fleet of seven cars in Exercise 3.4.4.1. Remember that 16% of cars fail pollution tests (smog checks) in California, and that we assume each car is independent.
Write out how to calculate the probability of the fleet failing, i.e. at least one of the cars in the fleet failing, via simulation.
Simulate 5 fleets. Based on these simulations, estimate the probability at least one car will fail in a fleet.
Compute the probability at least one car fails in a fleet of seven.
(a) Starting at row 3 of the random number table, we will read across the table two digits at a time. If the random number is between 00-15, the car will fail the pollution test. If the number is between 16-99, the car will pass the test. (Answers may vary.)
(b) Fleet 1: 18-52-97-32-85-95-29 \(\rightarrow\) P-P-P-P-P-P-P \(\rightarrow\) fleet passes
Fleet 2: 14-96-06-67-17-49-59 \(\rightarrow\) F-P-F-P-P-P-P \(\rightarrow\) fleet fails
Fleet 3: 05-33-67-97-58-11-81 \(\rightarrow\) F-P-P-P-P-F-P \(\rightarrow\) fleet fails
Fleet 4: 23-81-83-21-71-08-50 \(\rightarrow\) P-P-P-P-P-F-P \(\rightarrow\) fleet fails
Fleet 5: 82-84-39-31-83-14-34 \(\rightarrow\) P-P-P-P-P-F-P \(\rightarrow\) fleet fails
(c) \(4 / 5 = 0.80\)
4. To catch a thief.
Suppose that at a retail store, \(1/5^{th}\) of all employees steal some amount of merchandise. The stores would like to put an end to this practice, and one idea is to use lie detector tests to catch and fire thieves. However, there is a problem: lie detectors are not 100% accurate. Suppose it is known that a lie detector has a failure rate of 25%. A thief will slip by the test 25% of the time and an honest employee will only pass 75% of the time.
Describe how you would simulate whether an employee is honest or is a thief using a random number table. Write your simulation very carefully so someone else can read it and follow the directions exactly.
Using a random number table, simulate 20 employees working at this store and determine if they are honest or not. Make sure to record the random digits assigned to each employee as you will refer back to these in part (c).
Determine the result of the lie detector test for each simulated employee from part (b) using a new simulation scheme.
How many of these employees are “honest and passed” and how many are “honest and failed”?
How many of these employees are “thief and passed” and how many are “thief and failed”?
Suppose the management decided to fire everyone who failed the lie detector test. What percent of fired employees were honest? What percent of not fired employees were thieves?