Lab 5: Smarties

This worksheet was written for the University of York by Stephen Connor, based on work of Gustav Delius, and is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.

This final lab is themed on smarties! It brings together many of the ideas that we’ve met throughout the semester, both in probability and statistics.

Smarties and probability

Waiting for a blue

We know that there are eight colours of smarties. Let’s start by storing these as a list in R.

colours_list <- c('red', 'green', 'blue', 'yellow', 
                  'orange', 'pink', 'violet', 'brown')

We continue to work with the model that each smartie has a random colour, and that the colours of different smarties are independent of each other. As we did in lectures, we write \[ p_B = \mathbb{P}\left(\text{a smartie is blue}\right) \] and so on.

Suppose that someone only likes blue smarties: they keep discarding smarties until they find the first blue one. How many smarties will they work through until they find the first blue one (including the blue one that they find, and eat!)? Since each smartie has probability \(p_B\) of being blue (and probability \(1-p_B\) of not being blue), we can model this as a sequence of independent Bernoulli trials, in which we’re waiting for the time of the first success (where finding a blue smartie is regarded as a “success”).

Letting \(X\) denote the number of smarties that this person works through (including the blue one), our assumptions imply that \(X\sim\mbox{\textup{Geom}}(p_B)\), and that \(\mathbb{E}\left[X\right] = 1/p_B\).

Let’s get R to simulate the sequence of Bernoulli trials. We need to keep sampling from the vector colours_list until we see a blue. We’ve already seen (in Lab 4) how to use a for() loop to iterate through a large number of calculations, but that required us to tell R exactly how many iterations were required: here we don’t know how many samples will be needed! So we require a different kind of loop – one that will keep evaluating until some condition is satisfied. We shall use a while() loop:

num_smarties <- 0
new_smartie <- ''
while (new_smartie != 'blue') {
  # sample until get a blue smartie
  new_smartie <- sample(colours_list, 1)
  print(new_smartie) 
  # increase the number of smarties seen by 1
  num_smarties <- num_smarties + 1
}
num_smarties

Let’s pick this apart a little. At the start we set num_smarties to zero: this is going to count how many smarties we’ve seen. We also create a variable called new_smartie, which records the colour of the latest smartie that we’ve seen: this is initialised to be an empty string.

Now we get to the while() loop. R first of all evaluates the condition in the round brackets; that is, it looks at whether the expression new_smartie != 'blue' is TRUE or FALSE. (Recall from Lab 1 that the expression != means “is not equal”.) Since the empty string is not equal to the strong 'blue', this test returns TRUE, and this tells R to evaluate the commands in the curly brackets.

At this point it samples a smartie colour from colours_list, and updates the value of new_smartie to equal this colour; the print command shows the new colour – this isn’t really necessary, but should help you to check that the code is functioning as expected. It then adds one to the num_smarties counter.

Now it goes back to the condition in the while() loop, and checks again to see whether new_smartie is 'blue'. We keep repeating the above steps until this test evaluates to FALSE (which will happen exactly when we have seen a blue smartie for the first time). When this happens, the code exits the loop and returns the final value of num_smarties: this tells us how many times the while() loop was evaluated, and therefore how many smarties were sampled.

We can now repeat this experiment multiple times, by combining it with a for() loop, as you’ve seen before. For example, to do this 10 times we could use the following code:

# set up vector of zeros, length 10
num_smarties <- rep(0, 10)
for (i in 1:10) { 
  # sample until get a blue smartie
  new_smartie <- ''
  while (new_smartie != 'blue') {
    new_smartie <- sample(colours_list, 1)
    # increase the number of smarties seen by 1
    num_smarties[i] <- num_smarties[i] + 1
  }
}
# look at the vector of smartie counts
num_smarties

Your turn

Answer quiz question 1.

Collecting a full set

Now suppose that, rather than waiting until we’ve seen the first blue smartie, we want to keep sampling smarties until we’ve got at least one of each colour. Let’s first use R to simulate this experiment, and then think a bit about the theory. We begin by creating a simple data frame of zeros, with one column for each colour.

collection <- data.frame(matrix(0, ncol = 8, nrow = 1))
colnames(collection) <- colours_list

We can now sample smarties, and keep track of how many we’ve seen of each colour by simply adding one to the column containing the colour that we observe. A simple way to do that is to get R to sample a number j from the vector 1:8, and then to add one to column j of collection.

But how long do we need to keep doing this? We don’t know how many samples will be needed, so we should again use a while() loop. The condition that we want R to check is whether or not there are any zeros in the data frame collection: if there is at least one zero, then there’s at least one colour smartie that we haven’t yet seen, and so we need to keep sampling.

We can check whether a particular element appears in a data frame (or vector, etc) using %in%. For example, to check whether the number 3 appears in the vector (1,2,3,4) we can do the following:

3 %in% c(1,2,3,4)

You can check that if we try the same thing with a vector not containing 3, we get an output of FALSE:

3 %in% c(1,2,5,4,6,6)

So we can put all of this together as follows:

collection <- data.frame(matrix(0, ncol = 8, nrow = 1))
colnames(collection) <- colours_list
set.seed(6)
while (0 %in% collection) { 
  # while there are any zeros in the data frame, keep sampling
  j <- sample(1:8, 1)
  collection[j] <- collection[j] + 1
}
collection

Your turn

Run the code above.
Now answer quiz question 2.

We can then look at the total number of smarties that we saw before observing at least one of each colour:

sum(collection)

This code gives us a single sample from the distribution of a random variable \(S\), which counts the total number of smarties that we see until we have observed at least one of each colour. By repeating this experiment we can estimate the distribution of \(S\).

Your turn

Answer quiz question 3.

The distribution of \(S\)

We can decompose \(S\) into a sum of random variables: \[ S = S_1 + \sum_{i=2}^8 (S_i - S_{i-1})\,. \] Here the random variable \(S_i\) records the first time that we have observed exactly \(i\) different colours of smarties. So \(S_1 = 1\) (with probability one), because once we’ve seen one smartie we will definitely have seen exactly one colour! \(S_2\) is the number of smarties until we’ve seen 2 different colours: that is random, and could take any value in the set \(\{2,3,4,\dots\}\). And of course \(S=S_8\), the time when we’ve first seen all 8 colours.

Now note that, once we’ve seen \(i-1\) colours, the number of additional smarties that we get through until we’ve seen exactly \(i\) colours has a geometric distribution: as we see each smartie we count it as a “success” if it’s a new colour, and a “failure” if it’s a colour that we’ve already observed. The probability of success, when we’ve already seen \(i-1\) colours, is \(1-(i-1)/8\). That is, \[ S_i - S_{i-1} \sim \mbox{\textup{Geom}}((9-i)/8)\,, \quad i=2,\dots,8\,. \] (Furthermore, the random variables \((S_i-S_{i-1})\) are independent. Convince yourself of this!)

So \(\mathbb{E}\left[S_i - S_{i-1}\right]= 8/(9-i)\), and we can use linearity of expectation to deduce that

\[ \mathbb{E}\left[S\right] = 1 + \sum_{i=2}^8 \frac{8}{9-i} = 8 \left(1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{8}\right) \approx 21.74 \,. \]

Smarties and statistics

In Lab 4 we created random samples rather artificially, by taking samples from a large real-estate dataset. Let’s now look at a situation in which we naturally have a set of random samples - one from each student taking part in an experiment.

A few years ago, a colleague collected data from students by giving each one a small box of about 40 smarties, each of which can be viewed as a sample from the total population of smarties. For each of the eight possible colours, students counted how many smarties of that colour were in their box, and these numbers were collected into a comma separated value (.csv) file.

The data

Let’s first of all download and read in the data:

download.file("https://sbconnor.github.io/IPS/labs/ips-smarties.csv", 
              destfile = "ips-smarties.csv")
smarties <- read.csv("ips-smarties.csv")

This should create a data frame smarties that will show up in the Environment pane of RStudio. It will inform us that the data frame contains 282 observations of 12 variables. The first thing we always do after loading some data is to look at the first few observations:

head(smarties)

So we see that the data frame has one column for each colour as well as for the student’s first name and the year, hour and minute when the data was entered into the spreadsheet. Here we want to concentrate on the colour counts, so we take only the last 8 columns and assign them to a new data frame that we choose to call counts.

counts <- smarties[, 5:12]
head(counts)

Note how R allows us to address all rows of columns 5 to 12 of the smarties data frame with the notation smarties[, 5:12]. Remember (see Lab 2) that the empty slot before the comma means we want all rows, and the 5:12 after the comma means we want columns 5 to 12.

Tip

We could alternatively have written counts <- smarties[, -(1:4)], which would have told R to drop columns 1 to 4.

Next we want to know the sizes of the samples. For that we just have to add up the counts of all colours, i.e., we have to sum across each row.

n <- rowSums(counts)

To get a quick impression of the distribution of sample sizes we can look at

boxplot(n)

We notice that there are some extreme outliers! Let us sort the list of sample sizes to get a better idea:

sort(n)

It looks like some students were too hungry to do the count before starting to eat. Furthermore, one student has clearly mistyped: there could not possibly be 238 smarties in a box! Let us look at this entry in detail.

counts[n==238, ]

(Note how we were able to select only the row that corresponds to the observation with n=238.) It seems clear that the student typed 200 when they meant 2 for the number of red smarties. Let’s fix this.

counts$red[n==238] <- 2
n <- rowSums(counts)
boxplot(n)

Warning

Of course tampering with data like this is dangerous. If you do this, you need to always document what you have changed and why.

We might also want to consider removing some other outliers, but for now we’ll keep them.

Estimating probabilities

Next we calculate the estimated probabilities for the different colours that can be obtained from each sample, by dividing the counts by the sample size.

ratios <- counts/n
head(ratios)

Note again how easily R can work with entire data frames at once. Here we are dividing the data frame counts by the vector n and R has no difficulties understanding what we mean.

We see that as expected, each sample gives different estimates for the probabilities. We can get a quick overview of the distribution of these estimates by making a boxplot diagram.

boxplot(ratios, col = colours_list)

There is a very noticeable outlier with an unusually high proportion of blue smarties. Let’s look at the size of the sample that this observation comes from.

n[ratios$blue >= 0.4]

This is one of the unusually small samples. Let us look at the plot when we exclude the unusually small and large samples.

ratios_clean <- ratios[abs(n - mean(n)) < 10, ]
boxplot(ratios_clean, col = colours_list)

That removes a few more outliers: we’re now left with 278 observations.

We get the best possible estimate of the probabilities by combining all our samples into one big sample. To do this we just need to sum along the columns.

counts_total <- colSums(counts)
sort(counts_total)

We see that blue and violet are particularly abundant. We again get the estimates for the probabilities by dividing the counts by the sample size.

n_total <- sum(counts_total)
probs <- counts_total/n_total
barplot(probs, col = colours_list)
lines(c(0, 10), c(1/8, 1/8), lty=2) # add dashed line at height 1/8

Not all colours are equal

The sample size is now large enough to allow us to say that it is unlikely that our data would have arisen if all colours had been equally likely.

For example, let’s look at the total number of blue smarties, which came out as 1589. If the true probability were \(p_B=1/8\) then the number of blue smarties among the total number of 1.1103^{4} smarties would be distributed as \(\mbox{\textup{Bin}}\)(1.1103^{4}, 1/8). Hence we can use the distribution function of the binomial distribution to calculate the probability of getting an estimate as large or larger than the observed one:

1 - pbinom(counts_total['blue'], n_total, 1/8)

We see that getting such a large number of blue smarties would be highly unlikely if the true probability for a smartie to be blue were equal to 1/8.

Your turn

The total number of brown smarties is quite low. How likely is it to get a number that low, or lower, if the probability for a smartie to be brown is 1/8?
Answer quiz question 4.

Sampling distributions

Now let us take a closer look at the sampling distributions for the probabilities. Let’s start with the yellow smarties:

hist(ratios$yellow, probability = TRUE, col = 'yellow', ylim = c(0,8))

We expect this to be approximately an i.i.d. sample from the sample mean \(\bar{Y}_{40}\). I say “approximately” because not all samples had size exactly equal to 40, but close enough. \(\bar{Y}_{40}\) has an expectation of \(p_Y\) and a variance of \(p_Y(1-p_Y)/40\). So the histogram of the observed values should be close to the density of the normal distribution with this mean and this variance. Let us plot this density on top of the histogram:

x <- seq(min(ratios$yellow), max(ratios$yellow), length.out = 100)
lines(x, dnorm(x, mean=probs['yellow'], 
               sd=sqrt(probs['yellow']*(1-probs['yellow'])/40)))

That looks pretty reasonable. Just to reassure us that it did not particularly matter that not all samples are of size exactly 40, let us also plot the densities corresponding to sample sizes 37 and 43.

lines(x, dnorm(x, mean=probs['yellow'], 
               sd=sqrt(probs['yellow']*(1-probs['yellow'])/37)))
lines(x, dnorm(x, mean=probs['yellow'], 
               sd=sqrt(probs['yellow']*(1-probs['yellow'])/43)))

These densities are all fairly similar.

Now let’s use a for loop to make such a histogram for the sampling distribution of every colour:

par(mfcol=c(2,4))
for (i in 1:8) {
    hist(ratios[[i]], probability = TRUE, col = colours_list[i], ylim = c(0,8),
         xlab = colours_list[i], main = "" )
    x <- seq(min(ratios[[i]]), max(ratios[[i]]), length.out = 100)
    lines(x, dnorm(x, mean=probs[i], sd=sqrt(probs[i]*(1-probs[i])/40) ))
}
par(mfcol=c(1,1))

The theoretical considerations agree pretty well with our data. You could also produce Q-Q plots to check the quantiles of each distribution against those of the normal distribution: see Lab 3 for a reminder of how to do this.

Correlations

Our model also makes predictions about the covariances between the counts. The theoretical result for the covariance between the number of smarties of two different colour, as we calculated in a lecture, is \(-n\) times the product of the probabilities of the colours. That is, if \(Y\) denotes the number of yellow smarties, and \(B\) denotes the number of blue smarties, then \(\textup{Cov}\left(B,Y\right) = -np_Y p_B\), where \(p_B\) is the probability of a random smartie being blue. In particular if our model is correct the covariances should all be negative.

Note

This makes intuitive sense! If we observe a large number of one colour, we should expect to see fewer of another colour, given that we expect each box to contain approximately 40 smarties.

R can estimate all these covariances from the data in one go.

cov(counts)

We see that some estimated covariances involving the number of brown smarties are positive. However it has to be taken into account that our dataset is not really large enough to make very reliable statements, given the small number of brown smarties. If these positive covariances persisted also in a larger dataset then we would have to modify our model and drop the assumption of the independence between individual smarties. It is quite conceivable that the mixing among smarties in the factory is not perfect, so that on the conveyor belt that fills the smarties into their boxes, smarties of similar colours are still clustered together, making it more likely that consecutive smarties are of the same colour.

We can also calculate the correlation coefficients among the ratios:

cor(ratios)

Of course it is also possible that there were observational errors, due to similarities between different colours. So perhaps the strong negative correlation between orange and red is due to the misclassification of some red smarties as orange smarties.

Your turn

Extract the rows of data corresponding to counts that were made in the year 2019. Using only these samples, estimate the probabilities of the different colours.
Answer quiz question 5.

Take a look at the probability that we would observe so many blue smarties if all colours had been equally likely. There is definitely something strange going on!

Confidence intervals

Finally, we turn our attention to deriving a confidence interval for the probability of a random smartie being yellow. Under the assumption of independence between the smarties, we have that the number of yellow smarties in a box, \(Y\), has a \(\mbox{\textup{Bin}}(n, p_Y)\) distribution, where \(p_Y\) is the probability that an individual smartie is yellow, and \(n\) is the number of smarties in a box. We already know that \(Y/n\) is an unbiased estimator for the probability \(p_Y\). We now want to derive a confidence interval for \(p_Y\).

We can write \(Y\) as the sum of independent and identically distributed random variables, \[ Y = \sum_{i=1}^n Y_i\, , \] where \(Y_i\) is the indicator random variable for the event that the i^th smartie in the box is yellow. Therefore we know from the central limit theorem rule of thumb that \(Y\) is approximately normally distributed \[ Y \sim\mbox{\textup{N}}(np_Y, np_Y(1 − p_Y))\, . \] We can transform this to obtain a random variable which approximately follows a standard normal distribution:

\[ Z = \frac{Y − np_Y}{\sqrt{np_Y(1 − p_Y)}} \sim \mbox{\textup{N}}(0,1)\,. \]

It follows that \[ \mathbb{P}\left(−z_{\alpha/2} < Z < z_{\alpha/2}\right) \approx 1 − \alpha\,. \] We now only have to translate the condition on \(Z\) into a condition on \(p_Y\) to get our approximate \(100(1-\alpha)\%\) confidence interval for \(p_Y\).

\[ \begin{split} &-z_{\alpha/2}<Z<z_{\alpha/2}\\ \Leftrightarrow\ &Z^2<z_{\alpha/2}^2\\ \Leftrightarrow\ &\frac{(Y-np_Y)^2}{np_Y(1-p_Y)}<z_{\alpha/2}^2\\ \Leftrightarrow\ &(Y-np_Y)^2-np_Y(1-p_Y)z_{\alpha/2}^2<0\\ \Leftrightarrow\ &L<p_Y<U\,, \end{split} \]

where \(L\) and \(U\) are the solutions of the quadratic equation \[ (Y-np_Y)^2-np_Y(1-p_Y)z_{\alpha/2}^2=0\,. \] Solving this we get \[ \begin{split} U&=\frac{Y+\frac{1}{2} z_{\alpha/2}^2+\sqrt{\frac{1}{4}z_{\alpha/2}^4+z_{\alpha/2}^2(Y-Y^2/n)}}{n+z_{\alpha/2}^2},\\ L&=\frac{Y+\frac{1}{2} z_{\alpha/2}^2-\sqrt{\frac{1}{4}z_{\alpha/2}^4+z_{\alpha/2}^2(Y-Y^2/n)}}{n+z_{\alpha/2}^2}. \end{split} \] Let’s take our observed counts of yellow smarties, and use these to calculate approximate 95% confidence intervals for \(p_Y\).

# extract counts of yellow smarties
y <- counts$yellow
# observed proportion of yellows in each box
prop_y <- y/n

# limits of 95% confidence interval for p_Y, using binomial model
z <- qnorm(0.975)
u <- (y+z^2/2+sqrt(z^4/4+z^2*(y-y^2/n)))/(n+z^2)
l <- (y+z^2/2-sqrt(z^4/4+z^2*(y-y^2/n)))/(n+z^2)

# plot these estimates and their confidence intervals
plot(NULL, type = "l", xlab = "p_y", ylab = "", 
     xlim = c(0.0, 0.45), ylim = c(0, 10))
for (i in 1:10) {
    lines(c(l[i], u[i]), c(i, i))
    points(prop_y[i], i, pch = 20)
}

Here we have calculated the values of \(u\) and \(l\) for each of the first 10 boxes of smarties, and then plotted these values as a stack of horizontal lines. We’ve added a single point on each line to show the observed proportion of yellow smarties in each box; note that the confidence intervals are not symmetric about this value.

If we wanted to see how many of these confidence intervals contain the value 0.2 (for example), we could do this as follows:

sum(l[1:10] < 0.2 & 0.2 < u[1:10])

Your turn

Edit the code above to plot 90% confidence intervals for the first 100 smartie boxes. What proportion of these intervals contain the value \(1/8\)?
Answer quiz question 6.

Finally, suppose that we group all our smartie data together and use it to estimate \(p_Y\). Our overall smartie counts were as follows

counts_total
#>    red  green   blue yellow orange   pink violet  brown 
#>   1325   1347   1589   1410   1398   1252   1547   1235

and so our point estimate of \(p_Y\) is just \(1410/11103 = 0.127\). We can construct a single 95% confidence interval for \(p_Y\), using the complete set of data, as follows:

# total y count
y_total <- sum(y)
# estimate for p
p_y <- y_total/n_total 
# 95% confidence interval
z <- qnorm(0.975)
u <- (y_total+z^2/2+sqrt(z^4/4+z^2*(y_total-y_total^2/n_total)))/(n_total+z^2) 
l <- (y_total+z^2/2-sqrt(z^4/4+z^2*(y_total-y_total^2/n_total)))/(n_total+z^2)

This gives us an approximate 95% confidence interval of \((0.121, 0.134)\).

Your turn

Use all of the data, as above, to construct an approximate 95% confidence interval for \(p_B\), the probability of a smartie to be blue.
Answer quiz question 7.