With the data we have, we actually can do more interesting things. One obvoius thing to ask is: If we get data from new individuals, can we guess which populations they are from? For the sake of practice and demonstration, let's build some machine learning model for the prediciton.

First, we have some data from phase1 1000 Genome Project, and some of them are only in phase 1. Therefore, we could use the data above (phase 3) to train a model, then predict the superpopulation of the phase 1 individuals.

Now Let's make some plots and prediction

Practice