Lecture 17 Maximum Likelihood Estimation (MLE)

Previously we all assumed that we somehow knew the population parameters, such as the mean ($\mu$) and the variance ($\sigma^2$). Starting this lecture, we have finally reached a stage where we do NOT know the population parameters. Instead, we have a representative sample. Using the information from the sample, we “make a guess”, or estimate the parameters of the population.

In the simplest case, if we are interested in the population mean $\mu$ or variance $\sigma^2$, we can just provide a number to represent our best “guess” or estimate to the mean or variance. This is called the point estimation. How the question becomes: how do we provide a reasonable estimate Intuitively, we probably will use the sample mean $\bar{x}$ as the estimate for the population mean $\mu$ and the sample variance $s^2$ the population variance $\sigma^2$. It is consistent with our common sense, but why does it work?

Let’s set that question aside and introduce a systematic way of making an estimate.

Maximum Likelihood Estimation (MLE)

In this lecture we will introduce an intuitive way of making an estimate: the maximum likelihood estimation (MLE). The basic idea of MLE is pretty straightforward to understand. Once again, let’s stick to our basic principle: whenever we start to do something new, always, always start with something simple to get an intuition. We can start with a coin flipping example:

We have a coin with an unknown $\mathbb{P}(H)$. The coin is then flipped 10 times, independently. The result turns out to be “7 Hs and 3 Ts”. Based on this result we want to infer $\mathbb{P}(H)$.

We may never be sure about the true value of $\mathbb{P}(H)$. However, we have learnt quite a lot in the probability section. Using those knowledge, we can easily see that $0.5$ is more likely to be the true value of $\mathbb{P}(H)$ compared to, say, $0.00001$. I think you see the point here. Even though we may never be sure about the true value of $\mathbb{P}(H)$, what we can do is to figure out a value of $\mathbb{P}(H)$ between 0 and 1 such that it maximise our probability, or I should use the word likelihood (explained later), to observe the result “7 Hs and 3 Ts”. We use that value as our best guess, or the estimate, for $\mathbb{P}(H)$.

That’s it. That is the basic idea of maximum likelihood estimation.

The Likelihood Function

Now that we get an intuition of the basic idea of MLE, we need to formalise it and solve it in a quantitative and systematic way.

Suppose we have some observations $x_1,x_2,\cdots,x_{n-1},x_n$ from a random sample, that is, $n$ i.i.d. random variables $X_1,X_2,\cdots,X_{n-1},X_n$ drawn from a population with some unknown parameter(s) $\theta$. We define a likelihood function denoted by:

$$\mathcal{L}(\theta;x_1,x_2,\cdots,x_{n-1},x_n)$$

It means the likelihood of the parameter(s) takes the value(s) $\theta$, given that we have observed $x_1,x_2,\cdots,x_{n-1},x_n$. If we are dealing with a discrete case, the likelihood function is defined as:

$$\mathcal{L}(\theta;x_1,x_2,\cdots,x_{n-1},x_n) = \mathbb{P}(X_1=x_1,X_2=x_2,\cdots,X_{n-1}=x_{n-1},X_n=x_n)$$

Before we go further, let’s have a look at the meaning of the likelihood function. In the discrete case, the likelihood is defined as the probability of observing $X_1$ takes the value $x_1$ AND $X_2$ takes the value $x_2$ … AND $X_n$ takes the value $x_n$. Therefore, our task becomes finding the value(s) of $\theta$ that maximise the likelihood function $\mathcal{L}$.

Since $X_1,X_2,\cdots,X_{n-1},X_n$ are i.i.d., the above equation becomes:

$$ \begin{aligned} \mathcal{L}(\theta;x_1,x_2,\cdots,x_{n-1},x_n) &= \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2) \cdot \cdots \cdot \mathbb{P}(X_n=x_n) \\ &= \mathbb{P}(x_1;\theta) \cdot \mathbb{P}(x_2;\theta) \cdot \cdots \cdot \mathbb{P}(x_n;\theta)\\ &= \prod_{i=1}^{n}\mathbb{P}(x_i;\theta) \end{aligned} $$

where $\mathbb{P}(x_i;\theta)$ is the PMF of each $X_i$. They are all the same as the population.

Similarly, when we are dealing with continuous cases, we replace the PMF with PDF. Therefore, the likelihood function is generally written as follows, regardless of the type of the random variable:

$$\mathcal{L}(\theta;x_1,x_2,\cdots,x_{n-1},x_n) = \prod_{i=1}^{n}f(x_i;\theta)$$

The multiplication in the formula is kind of annoying to compute. In practice, we often look at the log likelihood function, which is the logarithm of $\mathcal{L}$. The base is not important, but we often use the natural log:

$$\ell(\theta;x_1,x_2,\cdots,x_{n-1},x_n) = \ln\mathcal{L} = \ln\prod_{i=1}^{n}f(x_i;\theta) = \sum_{i=1}^{n}\ln f(x_i;\theta)$$

Since the monotonic increasing nature of the logarithm, when $\ell$ takes the maximum, $\mathcal{L}$ also takes the maximum. Therefore, in practice we often want to find the value(s) of $\theta$ that maximise the log likelihood function $\ell$. How do we do that then? We let the derivative:

$$\cfrac{\mathrm{d}\ell}{\mathrm{d}\theta} = 0$$

and we solve for $\theta$. We make sure the derivative at the left-hand side of the solution is positive AND the right-hand side negative. Then we know that $\ell$, and hence $\mathcal{L}$ takes the maximum at the solution. We use a few examples during the lectures to demonstrate the method and you will also be asked to practice in the homework. They are all relatively easy to to compute by hand. The purpose is to get you familiar with the idea, not the computation per se.

Notations & Terminologies

The symbols “$;$” vs “$|$”

You sometimes will see the likelihood function written like this:

$$\mathcal{L}(\theta|x_1,x_2,\cdots,x_{n-1},x_n) = \prod_{i=1}^{n}f(x_i|\theta)$$

They have the same meaning as what we are introducing here. We choose to use the semicolon “$;$” instead of using the pipe “$|$”. The reason is that we already used the pipe symbol $|$ in conditional probability, such as $\mathbb{P}(A|B)$. You see, what follows a pipe symbol is a probabilistic event. Here, we are talking about parameters ($\theta$) of a distribution. The parameters are not really probabilistic events. Therefore, we stick to using the semicolon “$;$”.

Derivative Notations

When we are dealing with derivatives, there are generally three different notations:

1. Lagrange’s Notation

This is the one we first learnt in our hight school: $f^{\prime}$, read as “f prime”. The advantage of this notation is its simplicity. It is most commonly seen in single-variable functions. The second, third, … derivatives can be simply written as $f^{\prime\prime}$, $f^{\prime\prime\prime}$ etc.

2. Leibniz’s Notation

The derivative for $y=f(x)$ is expressed as: $\cfrac{\mathrm{d}y}{\mathrm{d}x}$, read as “d y d x”. We can also write it as: $\cfrac{\mathrm{d}f(x)}{\mathrm{d}x}$ or $\mathrm{d}\cfrac{f(x)}{\mathrm{d}x}$. The advantage of this notation is that it is very clear on what the variable is when we are taking the derivative. This is especially helpful when dealing with multi-variable functions, such as:

$$z = x^2 + y^2$$

If we do a partial derivative with respect to $x$, we write $\cfrac{\partial z}{\partial x}$; if we do a partial derivative with respect to $y$, we write $\cfrac{\partial z}{\partial y}$. The variable in each case is very clear to us. Note that the $\partial$ symbol has the same meaning with $\mathrm{d}$ in the single variable case. It is just that when we do a derivate on multi-variable functions, we use $\partial$.

3. Newton’s notation

It is as $\dot y$, $\dot f$ etc. which seems to be used in certain areas of research.

Estimator & Estimate

We put a hat symbol " $\hat{}$ " on top of a parameter to denote the estimator or estimate of the parameter. That is, $\hat{\theta}$ is an estimator or estimate of $\theta$. Use the population mean $\mu$ as an example, $\hat{\mu}$ is an estimator or estimate of $\mu$. During the lecture, we see that with maximum likelihood estimation, we have:

$$\boldsymbol{\hat{\mu}} = \cfrac{1}{n}\sum_{i=1}^{n}X_i \textmd { or } \hat{\mu} = \cfrac{1}{n}\sum_{i=1}^{n}x_i$$

On the left, it is an estimator, which contains instructions of making an estimation. Apparently, an estimator is a random variable. If an estimator is obtained using maximum likelihood estimation, then it is a maximum likelihood estimator. On the right, it is an estimate. That is, if you follow the instruction in the estimator by putting the values in, you will get a number out, which is the estimate. You can see that an estimate is just the number that the estimator takes.

When we have an estimator $\boldsymbol{\hat{\theta}}$, if $E[\boldsymbol{\hat{\theta}}]=\theta$, we say $\boldsymbol{\hat{\theta}}$ is an unbiased estimator for the parameter $\theta$.

Likelihood vs Probability

Most of the time, those two terms are used interchangeably. However, statisticians like to make distinctions about them. We use probability to talk about probabilistic events. The word likelihood is often used to describe probability distributions and models, which very often involves in figuring out some parameters of the distribution. I also quote the definition from Wolfram:

Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.

References