As mentioned in our first lecture, our genome is extremely conserved. People say that the difference between any two people is about 0.1% (source). That is the number when we only talk about single nucleotide polymorphisms (SNPs). If we take insertion, deletion, inversion, translocation etc. into account, about 1.6% of our genome is different (source), which is also small. However, remember that our genome is huge: 3 billion ($3 \times 10^9$) base pairs for a haploid genome. Therefore, we can say that the difference among individuals is huge, which makes sense and follows our observation in real life. It is that small percentage of the genome that makes each of us unique.
Did you know that the study of forensics is essentially using the information from those 1.6% of the genome to identify individuals from the crime scenes? The Federal Bureau of Investigation (FBI) are routinely using 13 locations 1 in our genome to identify individuals in the United States, and Interpol is using 10 standard locations for the United Kingdom and Europe (source)2. The point here is that genomic variants are important. That’s why projects, like the 1000 Genome Project, are trying to produce a catalog of genetic variants among different populations.
In this course, we only focus on SNPs. They already tell us a lot about ourselves. Very often, those SNPs have different distributions among different populations. For example, check the SNP rs3916235
from dbSNP. You will see they have completely different allele frequencies in East Asian, South Asian, European, African and American populations. Now some of you might wonder: how many SNPs do we need to discriminate those five populations? Or a simpler question: whether the 214 loci you are testing during the course are enough to discriminate those five populations? Let’s find out.
The purpose here is to group each individual based on similarities of those SNPs, and visualise the overall structure among individuals. Here, we describe each individual as 214 SNPs, so we have a high-dimension data: 214 dimensions (well … not exactly 214, see later section). We are not good at visualising and imagining data at > 3-dimensional space. A common practice is to utilise a Dimensionality Reduction technique to reduce the dimension of the data while still preserve some sort of structure. Principal component analysis, or PCA, is pretty much the go-to method when it comes to dimensionality reduction. There are many publications using PCA on the genotype data to show population structures, which seems to be a standard thing to do. However, it is not entirely clear to me how exactly the PCA is done. For example, PCA works on numerical data. How do you convert the categorical genotype data into numerical data? Here, we are documenting exactly how we are doing this to keep a record for future reference. Again, all the analysis is done inside the bios201
directory unless stated otherwise.
1. Decide on the method
Let’s start with the method of converting the genotype data to numerical data. I found the method section by Padakanti et al. very informative. I quote from the method:
We download the variant calling files (VCF) from the phase 3 data of the 1000 Genomes Project. We select the loci according to the following criteria: (1) they are located on autosomes, (2) they are biallelic among the subjects, (3) they have valid entries in more than 90% of the subjects. 30,761,503 loci meet those conditions. The resulting data is a 30,761,503 $\times$ 2504 matrix with phased, biallelic entries $(0|0,\ 0|1,\ 1|0,\ 1|1)$, where 0 and 1 denote major and minor alleles respectively. We further add the biallelic entries of each locus to form a genotype matrix $\boldsymbol{X}$ which take values in {$0,\ 1,\ 2$}. The genotype matrix is used in PCA, and the phased data is used in tract inference analysis.
That looks very clear and makes sense. We don’t really need all of those SNPs, just the 214 SNPs we are investigating in this course.
2. Get the data from the 1000 Genome Project
Now let’s get some data from the 1000 Genome Project. Go to the IGSR FTP site, or the NCBI FTP counterpart. Here you will find the 2504 individuals in their phase 3 study. The meta data is in the integrated_call_samples_v3.20130502.ALL.panel
file. You will also see ALL.{chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
files. Those are the VCF, one file per chromosome. The file sizes are big, and we only need the 214 SNPs we are interested in. In theory, we could directly extract those SNPs from the VCF file using bcftools
or tabix
without downloading them. However, the internet connection in our university is not stable and we might need them in future, so let’s download them to your local machine. We use the NCBI FTP since it is faster in our university. We can organised all the url into a text file, and download the data using wget
:
wget -c -P 1kg_phase3_release_20130502_vcf \
-i 1kg_phase3_release_20130502_vcf_url.txt
We could extract those SNPs based on the locations, which should be very quick with bcftools
. However, it seems we get more accurate results if we just use the dbSNP ID. Therefore, we are doing this in a slower way:
# get the dbSNP IDs for the 214 SNPs which are basically the fasta header
grep '>' MGI358.SNP.fa | cut -c 2- > MGI358.SNP.id.txt
# print the recordes in the VCF from the 1000 Genome Project that has the dbSNP IDs of our interest
zcat 1kg_phase3_release_20130502_vcf/*.gz | grep -w -f MGI358.SNP.id.txt > tmp
# add the VCF header to the output
cat <(zcat 1kg_phase3_release_20130502_vcf/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | head -n 253 | tail -n 1) tmp > MGI358_in_1kg_phase3.tsv
# remove intermediate file
rm tmp
Now we could convert the genotypes of each individual from the MGI358_in_1kg_phase3.tsv
file into a matrix by simply doing:
python output_1kg_gt_matrix.py > 1kg_phase3_MGI358_genotype_matrix.tsv
Here is the content of output_1kg_gt_matrix.py
:
with open('MGI358_in_1kg_phase3.tsv') as fh:
header = fh.readline().strip().split('\t')
print('%s\t%s' % (header[2], '\t'.join(header[9:])))
for line in fh:
rec = line.strip().split('\t')
num = []
for gt in rec[9:]:
a1, a2 = gt.split('|')
num.append(str(int(a1) + int(a2)))
print('%s\t%s' % (rec[2], '\t'.join(num)))
3. Perform PCA
It turns out that 199 out of 214 SNPs from the MGI358.SNP.fa
are in the 2504 individuals. Therefore, the 1kg_phase3_genotype_matrix.tsv
file is a 199 $\times$ 2504 matrix. Once we have this file, we are finally ready to perform PCA using the SNPs in this course to see if they can discriminate different populations.
Like always, we need some tools for that purpose. In this case, we need specific python packages, including numpy
, pandas
and scikit-learn
. We also need matplotlib
and seaborn
for the plotting. In addition, we want to use jupyterlab
to document the code and output for the purpose of reproducibility. In this case, the exact version does not really matter, so you can just install them by:
conda install numpy pandas scikit-learn matplotlib seaborn jupyterlab
Once the packages are installed, you start with:
jupyter lab
Then a window from your browser should be open. If not, check the output from your terminal, you can simply put the url there into your web browser. The url should look like this http://localhost:8888/lab?token=59d7953433af9676733290895a80bc38c9babc2a3a985969
. Then, you can create a Python3 notebook and start coding.
First, import the necessary packages:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
When it comes to plotting, I like my figures and plots in a specific way, so I set them as (you don’t have to do this):
mpl.rcParams['axes.titlesize'] = 19
mpl.rcParams['axes.labelsize'] = 16
mpl.rcParams['legend.fontsize'] = 13
mpl.rcParams['legend.markerscale'] = 1
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14
mpl.rcParams['figure.dpi'] = 100
Now we load the genotype data (1kg_phase3_MGI358_genotype_matrix.tsv
) and the meta data (integrated_call_samples_v3.20130502.ALL.panel
). We also need to make sure the samples, i.e. individuals, appear in the same order in both data:
phase3_genotype = pd.read_csv('1kg_phase3_MGI358_genotype_matrix.tsv', sep = '\t', index_col = 0)
phase3_info = pd.read_csv('1kg_phase3_release_20130502_vcf/integrated_call_samples_v3.20130502.ALL.panel',
sep = '\t', index_col = 0, usecols=[0,1,2,3])
phase3_info = phase3_info.loc[phase3_genotype.columns, :]
We could perform PCA using the scikit-learn
package. The good thing about the models in scikit-learn
is that they all pretty much have the same “grammar” to use. You always create an instance of your model. Then you use the methods associated with the instance to do certain things. Very often, it is .fit
to let the model to learn the data, then .transform
to apply dimensionality reduction.
In terms of PCA, this is how it is done:
pca = PCA(n_components=2) # create the instance
pca.fit(phase3_genotype.T) # fit the model
X_r = pca.transform(phase3_genotype.T) # apply dimensionality reduction
phase3_info['PC1'] = X_r[:, 0]
phase3_info['PC2'] = X_r[:, 1]
Now we could make the visualisation:
fig, axs = plt.subplots(figsize = (14,7), ncols=2, sharex=True, sharey=True)
sns.scatterplot(data = phase3_info, x = 'PC1', y = 'PC2', hue = 'super_pop',
edgecolor = None, palette = 'Set1', s = 5, ax = axs[0])
sns.scatterplot(data = phase3_info, x = 'PC1', y = 'PC2', hue = 'gender',
edgecolor = None, palette = 'Set1', s = 5, ax = axs[1])
for ax in axs:
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(title='', frameon = False)
plt.savefig('PCA_by_superpop_gender.pdf', bbox_inches = 'tight')
Pratice
- Describe the results
- What can you say from the PCA results
The good thing about PCA is that we could investigate what causes the patterns on each PC. It seems PC1 separates African from non-African populations. Now let’s look at what features contribute to this separations:
# first we sort the loading of each feature
pca.components_[0, :].argsort()
# the above result indicates that the 67th feature contributes the most to the negative side (non-AFR)
# the 78th feature contributes the most to the positive side (AFR)
print(phase3_genotype.index[67])
print(phase3_genotype.index[78])
We get rs3916235
and rs2814778
. We could look at their values in each individual:
phase3_info['rs3916235'] = phase3_genotype.loc['rs3916235',].astype(str)
phase3_info['rs2814778'] = phase3_genotype.loc['rs2814778',].astype(str)
fig, axs = plt.subplots(figsize = (14,7), ncols=2, sharex=True, sharey=True)
sns.scatterplot(data = phase3_info, x = 'PC1', y = 'PC2', hue = 'rs3916235',
edgecolor = None, palette = 'Set1', s = 5, ax = axs[0])
sns.scatterplot(data = phase3_info, x = 'PC1', y = 'PC2', hue = 'rs2814778',
edgecolor = None, palette = 'Set2', s = 5, ax = axs[1])
for ax in axs:
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(title='', frameon = False)
axs[0].set_title('rs3916235')
axs[1].set_title('rs2814778')
plt.savefig('PCA_by_SNPs.pdf', bbox_inches = 'tight')
Practice
- Check those two SNPs in dbSNP to see what they are
- Can you do similar things on PC2
All the above codes and plots can be found in this notebook. You can download it and run by yourself. Or you could have a look at an HTML version of it. They also contain a bit more stuff where we use different models to predict the super population of new data.
I hope this is helpful.
-
Note that these 13 locations are not SNPs. They are called short tandem repeats (STR). ↩︎
-
Check this perspective: Legal and public policy issues in DNA forensics ↩︎