Lecture 18 The Error Curve Derived By MLE

Ronald Fisher introduced the method of using maximum likelihood estimation to estimate the parameters of a distribution or population. However, it was Carl Friedrich Gauss who first developed the idea of maximum likelihood. Actually, the normal PDF can be derived using the idea of maximum likelihood. We will investigate in this lecture.

Like mentioned in Lecture 12, the first question everybody has when they first encountered the normal PDF is: where does it come from? The PDF is beautiful. Why are two of the most important constants $\pi$ and $e$ there? The eureka moment comes after reading The Evolution of the Normal Distribution by Prof. Saul Stahl and, of course, Gauss’ Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, the translated version. The most intuitive way of deriving the normal PDF is actually follow the history to see what was known in Gauss’ time and how he derived the normal PDF as a curve to describe errors in measurements, that is, the error curve.

“Inventing” or finding the error curve is difficult. Many tried and got it wrong. Here are a few examples from Thomas Simpson and Pierre Simon Laplace:

However, following Gauss’ thoughts to derive the normal PDF is not that difficult. We just need to get familiar about the historical context and be very clear about the assumptions people had back then about errors.

Assumptions About Errors Around & Before Gauss’ Time

We have talked about some history in previous lectures. You might notice that many historical figures we introduced are working on astronomy. Back in the days, many astronomers were trying to measure and predict the trajectories and positions of different planets and stars. They were well aware that their measurements had errors due to the observers, the instruments and many other factors. Even for the same object, they had distinct observations. In this context, an error refers to the difference between the measurement and the real location. Around the end of the 16th century, Tycho Brahe, the astronomer known for his accurate astronomical observations, incorporated repeated measurements into his methodology, a practice adopted by many astronomers. Surprisingly, he did not really specify how he converted those different measurements into a single representative value to represent the true location of an object. Did he use the mean? Did he use the median? Or did he just choose one that he saw fit? We do not know.

Galileo Galilei is thought to be the first person who proposed a systematic analysis of errors in writing. In his work Dialogue Concerning the Two Chief World Systems, Ptolemaic and Copernican in 1632, Galileo informally discussed what he called “observational errors”, which is basically what is called “distribution of random errors” today. He provided a few statements or assumptions based on accumulated astronomical data. The relevant ones to this lecture which I took from Hald 1986 are:

Assumptions

The observations are distributed symmetrically about the true value; that is, the errors are distributed symmetrically about zero.
Small errors occur more frequently than large errors

Note that there were no formal and solid proofs for those assumptions. They had been thought to be true based on the data accumulated in the past. The data seemed to support those two assumptions. In addition, they had been intuitive as well. For example, if an instrument is designed meticulously to measure the location of a planet or star, it is more likely for us to obtain a measurement around the true location. Just like when we draw a good random sample from a population, the sample is a micro-version of the population and we should not expect the sample to deviate too much from the population. If errors are random, they should be equally likely to be positively and negatively deviating from the true value.

There seemed to be some patterns of errors. In order to systematically analyse errors, people wanted to find the error curve to describe the errors. Using the terms we learnt during the lectures, they wanted to find the PDF of the error, since the error is apparently continuous.

How Did Gauss Do It?

Gauss was among one of those people trying to find the error curve. In his seminal work Theoria motus corporum coelestium in sectionibus conicis solem ambientium (Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections), Gauss used the idea of MLE (not exactly¹) and his educational guess to derive the following prototype:

where $\Delta$ is the error, which is basically $x-\mu$ in modern terms, and $h$ is a constant that “can be considered as the measure of precision of the observations”².

Now let’s use what we learnt in Lecture 17 to briefly and partially³ reproduce Gauss’ derivation.

First, we let the true location of an object of interest be $m$. Then we have some repeated measurements $X_1,X_2,\cdots,X_{n-1},X_n$, and we know they all have some errors. Let’s denote the error using the capital Greek letter $\Epsilon$ (Epsilon) $\Epsilon_i = X_i - m$. We think the errors have some pattern that can be described by an unknown PDF: $f_{\Epsilon}(\varepsilon)$.

Based on the two assumptions mentioned above, we have already known some properties of the PDF. Assumption 1 tells us that:

$$f_{\Epsilon}(\varepsilon) = f_{\Epsilon}(-\varepsilon)$$

which means $f_{\Epsilon}$ is an even function. Assumptions 2 tells us that the closer $\varepsilon$ to zero, the more likely it will occur. We can roughly see the shape of the function as something like these curves:

Now that we have a set of repeated measures $x_1,x_2,\cdots,x_{n-1},x_n$ that yield a set of errors $\varepsilon_i = x_i - m$. Using what we learnt from Lecture 17, the likelihood function on this set of data is:

$$\mathcal{L} = \prod_{i=1}^n f_{\Epsilon}(\varepsilon_i)$$

The log likelihood function is basically:

$$\ell = \sum_{i=1}^{n}\ln f_{\Epsilon}(\varepsilon_i)$$

Basically, we want to maximise the log likelihood function $\ell$. Therefore, we should let $\ell^{\prime} = \cfrac{\mathrm{d}\ell}{\mathrm{d}\varepsilon}=0$. Recall that $\left(\ln x\right)^{\prime} = \cfrac{1}{x}$. Using the chain rule, we have $\left[\ln f(x)\right]^{\prime} = \cfrac{f^{\prime}(x)}{f(x)}$. Then the derivative of our log likelihood function becomes:

$$\ell^{\prime} = \left[ \sum_{i=1}^n \ln f_{\Epsilon}(\varepsilon_i) \right]^{\prime} = \sum_{i=1}^{n}\cfrac{f_{\Epsilon}^{\prime}(\varepsilon_i)}{f_{\Epsilon}(\varepsilon_i)} $$

For simplicity, we let $g_{\Epsilon}(\varepsilon) = \cfrac{f_{\Epsilon}^{\prime}(\varepsilon)}{f_{\Epsilon}(\varepsilon)}$. Then $\ell^{\prime}$ becomes:

$$ \begin{aligned} \begin{align*} \ell^\prime &= \sum_{i=1}^{n}g_{\Epsilon}(\varepsilon_i) \\ &= g_{\Epsilon}(\varepsilon_1) + g_{\Epsilon}(\varepsilon_2) + \cdots + g_{\Epsilon}(\varepsilon_{n-1}) + g_{\Epsilon}(\varepsilon_n) \\ &= g_{\Epsilon}(x_1 - m) + g_{\Epsilon}(x_2 - m) + \cdots + g_{\Epsilon}(x_{n-1} - m) + g_{\Epsilon}(x_n - m) \tag{1} \end{align*} \end{aligned} $$

In Lecture 17, what we practised was that we knew the PDF, we just needed to figure out $m$ such that $\ell^\prime = 0$. In this case, we also want to let $\ell^\prime = 0$, but we know neither the PDF ($\boldsymbol{g}$ or $\boldsymbol{f}$) nor $m$. What should we do here?

Well, Gauss thinks that in order for $\ell^\prime = 0$, $m$ must take the value of the arithmetic mean of those measurements we have. Using the terms we have been using, it means:

$$\hat{m} = \cfrac{1}{n}\sum_{i=1}^n x_i$$

There was no formal proof that the above statement was true. It is just Gauss’ educational guess. Here I quote from Gauss’ original words (well … not really “original” but the translated version from page 258):

… It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it …

You see, like I said at the very first lecture of this course, mathematics is not always about calculations. Sometimes, you need some good intuition and guesses.

You may not think it was a big deal for Gauss to make that guess, because it is … intuitive …, right? Actually, it was not so obvious or intuitive that the mean should be used back then. The history of people getting used to the mean is long and complicated. For now you can check the last article in the References section to get an idea.

Now back to our work. We need to let $\ell^\prime = 0$. Using Gauss’ guess to replace $m$ with $\bar{x}$ in equation $(1)$, we have:

$$ g_{\Epsilon}(x_1 - \bar{x}) + g_{\Epsilon}(x_2 - \bar{x}) + \cdots + g_{\Epsilon}(x_{n-1} - \bar{x}) + g_{\Epsilon}(x_n - \bar{x}) = 0 \tag{2}$$

where $\bar{x}$ is the arithmetic mean of our measurements.

Equation $(2)$ must be true for any $n$ measurements. Therefore, if we let:

$$ x_1 = m \textmd{, \ \ and \ \ } x_2 = x_3 = x_4 = \cdots = x_{n-1} = x_n = m - nN$$

where $N$ is just some constant number. Then we have:

$$ \begin{aligned} \bar{x} &= \cfrac{x_1 + x_2 + \cdots + x_{n-1} + x_n}{n}\\[10pt] &= \cfrac{m + (m - nN) + (m - nN) + \cdots + (m - nN)}{n}\\[10pt] &= \cfrac{m + (n-1) \cdot (m-nN) }{n}\\[10pt] &= m-(n-1)N \end{aligned} $$

Putting those values back into equation $(2)$, we have:

$$ \begin{aligned} & g_{\Epsilon}(x_1 - \bar{x}) + g_{\Epsilon}(x_2 - \bar{x}) + \cdots + g_{\Epsilon}(x_{n-1} - \bar{x}) + g_{\Epsilon}(x_n - \bar{x}) = 0 \\[12.5pt] \Rightarrow & g_{\Epsilon}[m - (m-(n-1)N)] + g_{\Epsilon}[(m - nN) - (m-(n-1)N)] + \cdots + & \\ & g_{\Epsilon}[(m - nN) - (m-(n-1)N)] + g_{\Epsilon}[(m - nN) - (m-(n-1)N)] = 0 \\[12.5pt] \Rightarrow & g_{\Epsilon}[(n-1)N] + g_{\Epsilon}(-N) + \cdots + g_{\Epsilon}(-N) + g_{\Epsilon}(-N) =0 \\[12.5pt] \end{aligned} $$

Note that $g_{\Epsilon}(x_2-\bar{x}) = g_{\Epsilon}(x_3-\bar{x}) = \cdots = g_{\Epsilon}(x_n-\bar{x}) = g_{\Epsilon}(-N)$. There are $n-1$ of them, so we have:

$$g_{\Epsilon}[(n-1)N] + (n-1)g_{\Epsilon}(-N) = 0 \tag{3}$$

Now recall that Assumption 1 tells us that $f_{\Epsilon}(\varepsilon) = f_{\Epsilon}(-\varepsilon)$, meaning that it is an even function, so we have⁴:

$$f_{\Epsilon}^{\prime}(\varepsilon) = -f_{\Epsilon}^{\prime}(-\varepsilon)$$

Dividing $f_{\Epsilon}(\varepsilon)$ at the left-hand side and dividing $f_{\Epsilon}(-\varepsilon)$ at the right-hand side, we have:

$$\cfrac{f_{\Epsilon}^{\prime}(\varepsilon)}{f_{\Epsilon}(\varepsilon)} = - \cfrac{f_{\Epsilon}^{\prime}(-\varepsilon)}{f_{\Epsilon}(-\varepsilon)}$$

That means $g_{\Epsilon}(\varepsilon) = -g_{\Epsilon}(-\varepsilon)$, and hence $g_{\Epsilon}(-N) = -g_{\Epsilon}(N)$. Putting this back to equation $(3)$:

$$g_{\Epsilon}[(n-1)N] - (n-1)g_{\Epsilon}(N) = 0$$

Move the second term to the other side of the equation:

$$g_{\Epsilon}[(n-1)N] = (n-1)g_{\Epsilon}(N)$$

Now we have a situation of $f(a\cdot x) = a\cdot f(x)$, and In this case $\cfrac{\Delta f(x)}{\Delta x}$ must be a constant. Therefore, $g_{\Epsilon}$ must be a linear function in the form:

$$g_{\Epsilon}(\varepsilon) = k \cdot \varepsilon$$

where $k$ is some constant. Since $g_{\Epsilon}(\varepsilon) = \cfrac{f_{\Epsilon}^{\prime}(\varepsilon)}{f_{\Epsilon}(\varepsilon)}$, we have:

$$ \cfrac{f_{\Epsilon}^{\prime}(\varepsilon)}{f_{\Epsilon}(\varepsilon)} = k \cdot \varepsilon \tag{4} $$

Again, recall that $ \left[ \ln f(x) \right] ^{\prime} = \cfrac{f^{\prime}(x)}{f(x)}$, so if we integrate both sides of equation $(4)$ with respect to $\varepsilon$, we have:

$$ \begin{aligned} & \int \cfrac{f_{\Epsilon}^{\prime}(\varepsilon)}{f_{\Epsilon}(\varepsilon)}\, \mathrm{d}\varepsilon = \int k \cdot \varepsilon \mathrm{d}\varepsilon\\[12.5pt] \Rightarrow & \ln f_{\Epsilon}(\varepsilon) = \cfrac{k}{2} \, \varepsilon^2 + c \end{aligned} $$

where $k,c$ are some constants. Therefore, we have:

$$ f_{\Epsilon}(\varepsilon) = e^{\frac{k}{2}\varepsilon^2 + c} = e^c \cdot e^{\frac{k}{2}\varepsilon^2} = Ae^{\frac{k}{2}\varepsilon^2} $$

where $A, k$ are some constants.

Finally, we are actually getting somewhere. Now we need to figure out the constants $A, k$.

Assumption 2 tells us that the smaller $\varepsilon$ is, the higher the chance it will occur. Therefore, the coefficient before $e$ must be a negative number. Then we can let $\cfrac{k}{2} = -h^2$, where $h$ is some constant:

$$f_{\Epsilon}(\varepsilon) = Ae^{-h^2\varepsilon^2} \tag{5}$$

Since equation $(5)$ is a PDF, when we integrate from $-\infty$ to $\infty$, it must be equal to $1$. That is:

$$ \begin{aligned} \int_{-\infty}^{\infty} f_{\Epsilon}(\varepsilon) \mathrm{d}\varepsilon &= 1 \\[10pt] \int_{-\infty}^{\infty} Ae^{-h^2\varepsilon^2} \mathrm{d}\varepsilon &= 1 \\[10pt] A\int_{-\infty}^{\infty} e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon &= 1 \end{aligned} $$

When Gauss was doing the derivation, Laplace had already worked out the following integral, which we called the Gausssian integrals today:

$$ \int_{-\infty}^{\infty} e^{-x^2} \mathrm{d}x = \sqrt{\pi} \textmd{ \ \ and \ \ } \int_{-\infty}^{\infty} e^{-a^2x^2} \mathrm{d}x = \cfrac{\sqrt{\pi}}{a} $$

There are many proofs out there. One of the most popular proof is by doing a double integral and converting to polar coordinates. It is highly likely that you have already come across it in your calculus class. If not, I will let you search it on the internet by yourself, which should be easy to find.

Anyway, Gauss used the results from Laplace, and concluded $A = \cfrac{h}{\sqrt{\pi}}$. Therefore, the error curve that everybody had been looking for was:

$$\cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2}$$

We can take a moment to ponder how beautiful the process and the final product is.

The Modern Form of The Normal PDF

The above PDF is still a bit different from what we are using today as a normal distribution. We still need a bit work to do to work out the value of $h$. That’s where the variance $\sigma^2$ comes in.

By definition, the expected value of the above PDF is:

$$\mathbb{E}[\Epsilon] = \int_{-\infty}^{\infty} \varepsilon \cdot \cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2}$$

It is easy to see that the above function is an odd function. If we integrate from $-\infty$ to $\infty$, eventually the area under the curve at the negative side and the positive side will cancel out⁵, so $\mathbb{E}[\Epsilon] = 0$.

Now we use $\sigma^2$ to denote the variance of the error: $\textmd{var}(\Epsilon) = \sigma^2$. Again, by the common trick of calculating the variance, we have $\textmd{var}(\Epsilon) = \mathbb{E}[\Epsilon^2] - (\mathbb{E}[\Epsilon])^2 = \mathbb{E}[\Epsilon^2] -0^2 = \mathbb{E}[\Epsilon^2]$. Basically, the variance is $\mathbb{E}[\Epsilon^2]$. Therefore, we have the following:

$$ \begin{aligned} \mathbb{E}[\Epsilon^2] &= \sigma^2 \\ \int_{-\infty}^{\infty} \varepsilon^2 \cdot \cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon &= \sigma^2 \\[10pt] \cfrac{h}{\sqrt{\pi}} \int_{-\infty}^{\infty} \varepsilon^2 \cdot e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon &= \sigma^2 \end{aligned} $$

Note that $\varepsilon^2 \cdot e^{-h^2\varepsilon^2}$ is an even function. Therefore, the above equation becomes⁶:

$$\cfrac{2h}{\sqrt{\pi}} \int_{0}^{\infty} \varepsilon^2 \cdot e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon = \sigma^2 \tag{6}$$

Recall the integration-by-parts technique tells us:

$$\int\boldsymbol{u}(x)\boldsymbol{v}^{\prime}(x) = \boldsymbol{u}(x)\boldsymbol{v}(x) - \int \boldsymbol{u}^{\prime}(x)\boldsymbol{v}(x)\mathrm{d}x$$

Also note that $\left( e^{-h^2\varepsilon^2} \right)^{\prime} = -2h^2 \varepsilon e^{-h^2\varepsilon^2}$. Now we could rewrite equation $(6)$ as follows:

$$ \cfrac{2h}{\sqrt{\pi}} \int_{0}^{\infty} \left( -\cfrac{1}{2h^2}\, \varepsilon \right) \cdot \left( -2h^2 \varepsilon e^{-h^2\varepsilon^2} \right) \mathrm{d}\varepsilon = \sigma^2 \tag{7} $$

Now we can let $\boldsymbol{u}(\varepsilon) = -\cfrac{1}{2h^2}\, \varepsilon$ and $\boldsymbol{v}^{\prime}(\varepsilon) = -2h^2 \varepsilon e^{-h^2\varepsilon^2}$, and hence $\boldsymbol{v}(\varepsilon) = e^{-h^2\varepsilon^2}$. Using integration by parts, equation $(7)$ becomes:

$$ \begin{aligned} \cfrac{2h}{\sqrt{\pi}} \left( \left[ \left( -\cfrac{1}{2h^2}\, \varepsilon \right) \cdot \left( e^{-h^2\varepsilon^2} \right) \right]_{0}^{\infty} - \int_{0}^{\infty} -\cfrac{1}{2h^2} \cdot e^{-h^2\varepsilon^2} \right) &= \sigma^2 \\[10pt] \cfrac{2h}{\sqrt{\pi}} \left( \left[ -\cfrac{\varepsilon}{2h^2 e^{h^2\varepsilon^2}} \right]_{0}^{\infty} + \cfrac{1}{2h^2} \int_{0}^{\infty} e^{-h^2\varepsilon^2} \right)&= \sigma^2 \tag{8} \end{aligned} $$

Note there is a Gaussian integral in the second term in the parenthesis, so we can see that $\cfrac{1}{2h^2}\int_{0}^{\infty} e^{-h^2x^2} \mathrm{d}x = \cfrac{1}{2h^2} \cdot \cfrac{\sqrt{\pi}}{2h} = \cfrac{1}{4h^3} \sqrt{\pi}$. For the first term in the parenthesis, we can do:

$$ \begin{aligned} \left[ -\cfrac{\varepsilon}{2h^2 e^{h^2\varepsilon^2}} \right]_{0}^{\infty} &= \lim_{t \to \infty} \left[ -\cfrac{\varepsilon}{2h^2 e^{h^2\varepsilon^2}} \right]_{0}^{t} \\[12.5pt] &= \lim_{t \to \infty} \left( -\cfrac{t}{2h^2 e^{h^2t^2}} \right) - \left( -\cfrac{0}{2h^2 e^{h^2 \cdot 0^2}} \right) \\[12.5pt] &= - \lim_{t \to \infty} \left( \cfrac{t}{2h^2 e^{h^2t^2}} \right) - 0 \\[12.5pt] &= -\lim_{t \to \infty} \left( \cfrac{t}{2h^2 e^{h^2t^2}} \right) \end{aligned} $$

Here we have a situation of $\cfrac{\infty}{\infty}$. Using L’Hopital’s rule:

$$ \begin{aligned} \left[ -\cfrac{\varepsilon}{2h^2 e^{h^2\varepsilon^2}} \right]_{0}^{\infty} &= -\lim_{t \to \infty} \left( \cfrac{t}{2h^2 e^{h^2t^2}} \right) = -\lim_{t \to \infty} \left( \cfrac{t^\prime}{(2h^2 e^{h^2t^2})^\prime} \right) \\[12.5pt] &= -\lim_{t \to \infty} \left( \cfrac{1}{4h^4t e^{h^2t^2}} \right) = 0 \end{aligned} $$

Put those results back to equation $(8)$, we have:

$$ \begin{aligned} \cfrac{2h}{\sqrt{\pi}} \cdot \left( 0 + \cfrac{1}{4h^3} \sqrt{\pi} \right) &= \sigma^2 \\ \cfrac{1}{2h^2} &= \sigma^2\\ h &= \cfrac{1}{\sqrt{2}\sigma} \end{aligned} $$

Putting the value of $h$ back to Gauss error curve, we have:

$$\cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2} = \cfrac{1}{\sqrt{2\pi}\sigma}\,e^{-\frac{\varepsilon^2}{2\sigma^2}}$$

Note that the error is basically $x - m$, where $m$ is the true location. In our modern terms, it is basically the true mean $\mu$. Therefore, we have the general form of the normal PDF that we are using today:

$$f_{X}(x) = \cfrac{1}{\sqrt{2\pi}\sigma}\,e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

Amazing, isn’t it?

References

There seems to be some confusion about whether Gauss used maximum likelihood estimation or least squares to derive the normal PDF. The two methods lead to the same estimate ($\bar{x}$), but they are based on different concepts. I’m not qualified to comment on this … yet … See Gauss’s Derivation of the Normal Distribution and the Method of Least Squares, 1809 in the References for more details. For now, let’s go with MLE to see how to derive the normal PDF. ↩︎
Those are Gauss’ original words translated. See the paper in the References section. ↩︎
The full and rigorous proof is actually more complicated than the content presented in this post. You can check the papers in the References section. ↩︎
Think about the shape of an even function if you have trouble understanding this. ↩︎
Strictly speaking, what we should do here is $\int_{-\infty}^{\infty} \varepsilon \cdot \cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2} = \int_{-\infty}^{0} \varepsilon \cdot \cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2} + \int_{0}^{\infty} \varepsilon \cdot \cfrac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2}$ and show they converge. ↩︎
Again, strictly speaking, we should do $\int_{-\infty}^{\infty} \varepsilon^2 \cdot e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon = \int_{-\infty}^{0} \varepsilon^2 \cdot e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon + \int_{0}^{\infty} \varepsilon^2 \cdot e^{-h^2\varepsilon^2} \mathrm{d}\varepsilon$ and show they converge. ↩︎