Lecture 21 Sampling Distribution of The Sample Proportion

What we have been dealing with so far are all absolute quantities, such as the number of emails we received in a given time window and the body temperatures of healthy people. In many other situations, the relative quantities are more meaningful. In this lecture, we introduce a new population parameter: the population proportions, representing the proportion (or fraction, percentage etc.) of the thing of our interest. To be consistent with the notations we are using in this course, we use the Greek letter $\pi$ to denote the population proportion.

Population Proportion $\boldsymbol{\pi}$

Let’s use the correct answer of multiple choice question (MCQ) as an example. MCQs are a common type of question in many exams. Typically, four choices (a), (b), (c) and (d) are provided and only one of them is true. When I was a student, we were told that the test makers made sure that the correct answers had a uniform distribution among (a), (b), (c) and (d). In this case, the population can be thought as an abstract concept of all MCQs in the world. Let’s say we are interested in those questions whose correct answer is (a). We will talk about how to deal with (a), (b), (c) and (d) altogether in Lecture 34. For now, we only focus on those questions with (a) as the correct answer. We are interested in “how many” questions there are in the whole population whose correct answer is (a). I think you can appreciate that the absolute number here is not really useful if you take it out from the context. It is only meaningful when we compare the number to all other numbers, that is, the numbers of questions whose correct answers is (b), (c) or (d). Therefore, when we say “how many”, what we really mean is the fraction or proportion of the questions in the population whose correct answer is (a). This population proportion, denoted by $\pi$, is the new population parameter we are analysing in this lecture.

As usual, $\pi$ is usually unknown. What we can do is to draw a sample and use the data from the sample to estimate $\pi$. Let’s say, we take a random sample of size $n$, and it turns out that $x$ questions whose correct answer is (a). You guessed it, the point estimate for $\pi$ is basically the sample proportion. Again, to be consistent with our naming convention, we use capital $P$ to denote the random variable sample proportion, and small letter $p$ to denote the number taken by $P$. Do not confuse with the probability notations.

We have:

$$P=\cfrac{X}{n}$$

In order to use $P$ to get some generalised knowledge about the population parameter $\pi$, we need to know its behaviour. That is, the distribution of $P$, which we call it the sampling distribution of the sample proportion.

Indicator Variable

Let’s analyse the distribution of $P$. First, we take a sample of size $n$ from the population with the parameter $\pi$. That is, the proportion of MCQs whose correct answer is (a) is $\pi$. Like we discussed before, the best way of thinking about a sample is to treat them as $n$ i.i.d. random variables. In this case, we denote the random variables $I_1, I_2, \cdots, I_{n-1}, I_n$. They are slightly different from the random variables we talked about previously. These random variables take the outcome (a MCQ) from the population and return $1$ if the correct answer is (a) and $0$ otherwise. Now think about this question: what distribution do $I_i$ follow? Well … by definition, they all follow a Bernoulli distribution:

$$I_1, I_2, \cdots, I_{n-1}, I_n \sim Ber(\pi)$$

We call this type of random variable: indicator variables, hence the symbol $I$. They are very useful when we want to binarise the data. Based on the properties of the Bernoulli distribution, we know the population mean and variance are $\pi$ and $\pi(1-\pi)$. respectively:

$$ \begin{aligned} \mathbb{E}[I_1] = \mathbb{E}[I_2] = \cdots = \mathbb{E}[I_{n-1}] = \mathbb{E}[I_n] &= \pi \\ \mathbb{V}\textmd{ar}(I_1) = \mathbb{V}\textmd{ar}(I_2) = \cdots = \mathbb{V}\textmd{ar}(I_{n-1}) = \mathbb{V}\textmd{ar}(I_n) &= \pi(1-\pi) \end{aligned} $$

Now let’s also introduce two other random variables: the sum $(Y)$ and the sample mean $(\bar{I})$, which will be useful soon:

$$\bar{I} = \cfrac{1}{n}\sum_{i=1}^n I_i \textmd{ and } Y = \sum_{i=1}^n I_i$$

Before we go any further, let’s look at the meanings of these two random variables. Since each random variable $I_i$ takes the value $1$ if the correct answer is (a) and $0$ otherwise, the sum $Y$ is basically the total number of MCQs in the sample whose correct answer is (a). The sample mean $\bar{I}$ is just the proportion of MCQs in the sample whose correct answer is (a), and $\bar{I}$ is the sample proportion we want to analyse, that is, $P=\bar{I}$. This tells us that the sample proportion is just a special type of sample mean.

The central limit theorem tells us that regardless of the population distribution, as long as the sample size $n$ is large enough, the sample mean is approximately normal:

$$P \ \dot\sim \ \mathcal{N}\left( \mu_{P} = \pi, \, \sigma_{P}^2 = \cfrac{\pi(1-\pi)}{n} \right)$$

Note the subscript $P$, meaning the distribution is about the sample proportion. In the next lecture, we see how we can use this to calculate probabilities and make interval estimation.

Normal Approximation To The Binomial

Now let’s turn our attention to the random viable $Y$, which is the sum of $I_i$ and the meaning of the sum is the total number of MCQs whose correct answer are (a) in the sample. In a more general term, the sum represents the number of outcomes that satisfy our standard per $n$ outcomes.

Since we know that $I_1, I_2, \cdots, I_{n-1}, I_n$ are i.i.d. Bernoulli random variables with the parameter $\pi$, by definition the sum $Y$ follows a binomial distribution:

$$Y \sim B(n,\pi)$$

In addition, the central limit theorem tells us that, as long as $n$ is large enough, we should have expect:

$$Y \ \dot\sim \ \mathcal{N}(\mu = n\pi, \sigma^2 = n\pi(1-\pi))$$

Therefore, one can expect that we can use the normal distributions to approximate the binomial distributions when $n$ is large. This is a common trick to quick calculate probabilities. We will do some demonstrations during the lecture, and make you do some on your own in your homework.

References