So far we have talked a lot about doing hypothesis testing for a particular population parameter, such as the population proportion $\pi$, the population mean $\mu$ and the population variance $\sigma^2$. We take one random sample from the population, and test if the population parameter is equal to or greater/less than a specific value. This is called a one-sample test. Most of the time the specific value comes from our previous knowledge about the population.
In reality we very often need to compare the proportions or means of two different groups. This is a more useful test, called a two-sample test. We always start with the easier one. In this lecture, we introduce how do we compare two population proportions to see if they are equal or not, or if one is higher than the other. For example, we want to see if a drug can effectively treat certain disease. We compare two groups of people, one with placebo, the other with the drug. In this case, we are comparing the proportions of the cured patients between the placebo group and the drug group. The two samples here represent two populations, one is the population with placebo, the other the population treated with the drug. We want to figure out the proportions of cured people from the two populations are equal or not.
To this end, we introduce a new population parameter $\delta$, representing the difference of the parameters of our interest from the two populations. In our current lecture, it means the difference of the two population proportions. Therefore, our null and alternative hypotheses for a two-sided test are:
$$ \boldsymbol{H_0:} \ \;\delta = \pi_1 - \pi_2 = 0 \\ \boldsymbol{H_1:} \ \; \delta = \pi_1 - \pi_2 \neq 0 $$
Intuitively, the estimator for $\delta$ is basically the difference of the sample proportions:
$$ \boldsymbol{D} = \boldsymbol{P}_1 - \boldsymbol{P}_2 $$
Once we have our samples, we can have an estimate: $d = p_1 - p_2$. Now to (indirectly) figure out whether the difference we observed in our samples ($d$) is due to sampling variation or true population difference, we need to compute the p-value, which is:
Apparently, in order to calculate that, we need to figure out the distribution of $\boldsymbol{D}$, which is called the sampling distribution of the difference of the sample proportions.
Let’s say, we have two samples: sample 1 with a size $n_1$ comes from population 1 with a parameter $\pi_1$ and sample 2 with a size $n_2$ comes from population 2 with a parameter $\pi_2$. In previous lectures, we know that under some assumptions, we have
$$ \boldsymbol{P}_1 \sim \boldsymbol{\mathcal{N}}\left( \pi_1, \dfrac{\pi_1(1-\pi_1)}{n_1} \right) \textmd{ and } \boldsymbol{P}_2 \sim \boldsymbol{\mathcal{N}}\left( \pi_2, \dfrac{\pi_2(1-\pi_2)}{n_2} \right) $$
Since a linear function of a normal random variable is still normal, we can see that $-\boldsymbol{P}_2 = (-1) \cdot \boldsymbol{P}_2 \sim \boldsymbol{\mathcal{N}}\left( -\pi_2, \dfrac{\pi_2(1-\pi_2)}{n_2} \right)$. Since the sum of two independent normal random variables is still normal, so $\boldsymbol{D} = \boldsymbol{P}_1 - \boldsymbol{P}_2$ is still a normal random variable. We only need to figure out its mean and variance.
For the mean:
$$ \mathbb{E}[\boldsymbol{D}] = \mathbb{E}[\boldsymbol{P}_1 - \boldsymbol{P}_2] = \mathbb{E}[\boldsymbol{P}_1] - \mathbb{E}[\boldsymbol{P}_2] = \pi_1 - \pi_2 $$
Note we have two independent samples, so for the variance:
$$ \begin{aligned} \mathbb{V}\textmd{ar}(\boldsymbol{D}) &= \mathbb{V}\textmd{ar}(\boldsymbol{P}_1 - \boldsymbol{P}_2) = \mathbb{V}\textmd{ar}\left(\boldsymbol{P}_1 + [-\boldsymbol{P}_2]\right) = \mathbb{V}\textmd{ar}(\boldsymbol{P}_1) + \mathbb{V}\textmd{ar}(-\boldsymbol{P}_2) \\[7.5pt] &= \dfrac{\pi_1(1-\pi_1)}{n_1} + \dfrac{\pi_2(1-\pi_2)}{n_2} \end{aligned} $$
Therefore, we have figured out the distribution of $\boldsymbol{D}$:
$$ \boldsymbol{D} \sim \boldsymbol{\mathcal{N}}\left( \pi_1-\pi_2, \dfrac{\pi_1(1-\pi_1)}{n_1} + \dfrac{\pi_2(1-\pi_2)}{n_2} \right) $$
Then we have:
$$ \dfrac{(\boldsymbol{P}_1 - \boldsymbol{P}_2) - (\pi_1-\pi_2)}{\sqrt{\frac{\pi_1(1-\pi_1)}{n_1} + \frac{\pi_2(1-\pi_2)}{n_2}}} \sim \boldsymbol{\mathcal{N}}(0,1) $$
Now we can calculate the p-value based on the standard normal distribution. When we calculate the p-value, we are calculating the conditional probability where we live in a universe where the null hypothesis is true, which is $\delta = \pi_1 - \pi_2 = 0$, let’s denote this common proportion simply as $\pi$. Once we have the samples, our test statistic would be:
$$ z = \dfrac{p_1 - p_2}{\sqrt{\frac{\pi(1-\pi)}{n_1} + \frac{\pi(1-\pi)}{n_2}}} = \dfrac{p_1 - p_2}{\sqrt{\pi(1-\pi) \cdot \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \sim \boldsymbol{\mathcal{N}}(0,1) $$
We have one problem now: we do not know the value of $\pi$. The intuitive thing to do is to replace it with the sample estimate. Now the question becomes: what is the best estimate for $\pi$? Remember when we first entered the section of inferential statistics, we say that in terms of the sample size, bigger is always better. Since we assume the two population proportions are the same, then the best estimation would be the one by combining two samples:
$$ p = \dfrac{n_1p_1 + n_2p_2}{n_1 + n_2} $$
Replace $\pi$ with $p$, we can calculate the p-value.