Numerical Measures
Lecture 3 concludes the section of descriptive statistics. From the previous lectures we demonstrated that graphs were a nice way of showing the data. One can immediately get a rough idea about how the data look like from a good graph. However, when the data is presented in graphs, the quantitative information in the data is lost. Therefore, it would be good if we could just use a bunch of numbers to summarise the data. This was what we talked about during Lecture 3.
We have introduced some numerical measurements that summarise the data. In particular, we focused on two most frequently used measurements which you will see repeatedly in this course and in your real life:
- the mean, a measurement for the central tendency
- the variance and the standard deviation,measurements for the dispersion
In addition, some practical skills and data visualisation tricks were also introduced along the way.
The Mean
There are actually different kinds of means. The one that we often use is called the arithmetic mean that is basically the average of the data. This is the concept you are already very familiar with:
$$ \bar{x} = \cfrac{1}{n}\sum_{i=1}^{n}x_i $$
Another type of mean is called the geometric mean, which is defined as:
$$ GM = \sqrt[n]{x_1 \cdot x_2 \cdot x_3 \cdots x_n} = \left( \prod_{i=1}^{n}x_i \right)^{\frac{1}{n}}$$
The geometric mean is very useful especially in the field of chemistry and biology. For example, the pH of a buffer, the growth rate of bacteria, the length (in base pairs) of a gene etc. It is the geometric mean that is relevant in those cases. Note that the calculation of the product is computationally heavy. Therefore, the geometric mean is never really calculated in a direct way in practice. When we have to compute the product of many terms, one trick is to take the log:
$$ \begin{aligned} \log GM &= \log \left( \prod_{i=1}^{n}x_i \right)^{\frac{1}{n}} = \cfrac{1}{n} \log \prod_{i=1}^{n}x_i = \cfrac{1}{n} \log (x_1 \cdot x_2 \cdot x_3 \cdots x_n) \\ &= \cfrac{1}{n} ( \log x_1 + \log x_2 + \log x_3 + \cdots \log x_n ) \\[10pt] &= \cfrac{1}{n} \sum_{i=1}^n \left( \log x_i \right) \end{aligned} $$
As you can see, the geometric mean in the original scale can be obtained indirectly by calculating the arithmetic mean in the log scale. The base of the log does not really matter. In the end we could just use the arithmetic mean when we say the mean. Also, “take the log” is a general trick when dealing with data that spans several orders of magnitude. We will see this trick a lot in later sections.
Weighted Average
There is one extra thing we need to talk about in terms of the way of calculating means. Let’s say we got a data of the following numbers:
$$ 11, 13, 13, 11, 17, 11, 11, 13, 11, 13, 13, 11, 13, 13, 16, 11, 11, 11, 16, 11 $$
As you can see some of the numbers are repeating. We could organise them into a frequency table:
Number | Absolute Frequency |
---|---|
11 | 10 |
13 | 7 |
16 | 2 |
17 | 1 |
The total sum is:
$$ 11 \times 10 + 13 \times 7 + 16 \times 2 + 17 \times 1 $$
The total number of data points is:
$$ 10 + 7 + 2 + 1 $$
By definition, the mean is:
$$ \begin{aligned} \bar{x} &= \cfrac{11 \times 10 + 13 \times 7 + 16 \times 2 + 17 \times 1}{13 + 7 + 2 + 1} \\[10pt] &= \cfrac{11 \times 10 + 13 \times 7 + 16 \times 2 + 17 \times 1}{20} \end{aligned} $$
We could re-write it as:
$$ \bar{x} = \cfrac{10}{20} \times 11 + \cfrac{7}{20} \times 13 + \cfrac{2}{20} \times 16 + \cfrac{1}{20} \times 17 $$
This is basically the unique values multiplied by their relative frequency, respectively, and summed up together. In a more general case, when we have a data of $n$ numbers and some values are repeating, say the value $x_i$ appears $f_i$ times. We could calculate the mean in the following way:
$$ \begin{aligned} \bar{x} &= \cfrac{\sum_{i=1}^n{f_i x_i}}{\sum_{i=1}^n{f_i}} \\[10pt] &= \cfrac{f_1}{\sum_{i=1}^n{f_i}} \cdot x_1 + \cfrac{f_2}{\sum_{i=1}^n{f_i}} \cdot x_2 + \cfrac{f_3}{\sum_{i=1}^n{f_i}} \cdot x_3 + \cdots + \cfrac{f_n}{\sum_{i=1}^n{f_i}} \cdot x_n \\[15pt] &= w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + \cdots w_n \cdot x_n \\[10pt] &= \sum_{i=1}^n w_i x_i \end{aligned} $$
It doesn’t matter if there are no repeating values in the data. In that case, each $f_i$ is just $1$. The $\sum_{i=1}^n w_i x_i$ is called the weighted average. You may ask why we introduce a new term called weight, instead of just using relative frequencies. The reason is that the term “weight” is more general. The weight $w_i$ indicates how important we think $x_i$ is in our calculation. Sometimes it can be subjective. As long as it is reasonable, we are fine with it. Very often all the weights sum up to $1$. In the previous example, we think relative frequencies in our calculation of the mean are intuitive and reasonable. In other cases, we cannot use relative frequencies. For example, the pixel luminance on a screen is essentially a weighted average of RGB values (see here). The weights in this case are not relative frequencies but our opinion on how our eyes perceive the brightness of each colour.
I hope you get used to this interpretation of the mean as the weighted average, because this will become very useful when we come across the calculation of the average outcome of a probabilistic event.
The Variance And The Standard Deviation
Apart from the central tendency, we also care about the dispersion of the data. The common measurement is the variance. The sample variance is calculated as:
$$ s^2 = \cfrac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2 $$
The standard deviation $s$, which has the same unit as the original data, is basically the square root of the variance. We will talk about why dividing by $n-1$ in Lecture 18. Now you might ask why performing the calculation in this way. Why not use $\frac{1}{n} \sum_{i=1}^n|x_i - \bar{x}|$? While … you can. However, in statistics the fact that you can make a measurement does not necessarily mean it is useful. In order for a measurement to be useful, they need to have nice properties so that we could use those properties to make inferences. In future lectures, you will see that the sample variance has nice properties.
- Unix & Perl Primer for Biologists (Only the Unix part)
- How THIS wallpaper kills your phone by Mrwhosetheboss. Links: YouTube or Bilibili