Section 2: Sequencing Data Analysis

Now that you have the sequencing data of yourself, let’s identify the variants in your genome compared to the reference genome. In this section, we will only focus on single nucleotide polymorphisms (SNPs). Once we did that, we could look up in the public databases and literatures to see if the variants have any effect.

2.1 Computing environment setup

The first step of doing anything is always to set up our computing environment, which involes in installing the software tools we need for our purposes. There are quite a few tools that can do what we want. Here, we just choose the following ones:

During the lecture, we talked about using the following sequence of actions to install softwares:

./configure    # collect necessary information from your machine
make           # compile the source code to binary machine code
make install   # copy the programs to your $PATH

However, this can be very difficult for beginners. Therefore, we are going to use an easier route, that is, using a package management system. Conda is one of the most popular package management systems and environment management systems across different platforms, such as Windows, MacOS and Linux. In this course, we are using a minimal and light version of it: Miniconda. Go to the download page and get the installer for your system. If you are a Windows user, it is better to get the Windows Subsystem for Linux (WSL) and install the Linux version of Miniconda uder the WSL. If you are a Mac user, it is better to install the command line tools by typing xcode-select --install in your terminal, and then download the bash version of the miniconda installer.

Then, you can simply install Miniconda by (using the Linux installer as an example):

bash Miniconda3-latest-Linux-x86_64.sh

Just accept the default settings, and you should not have any problems with the installation. Once that is done, we can create an environment by simply typing conda create -n BIOS201 in your terminal. Then, we could activate the environment by conda activate BIOS201. If this is successful, you should see the string (BIOS201) in front of your prompt, like this:

(BIOS201) xichen@DESKTOP-QAKDLCA:~$

Now, install the tools we need by simply type the following in the terminal:

conda install -c bioconda bwa samtools bcftools tabix

Those tools, together with some commonly used commands in Linux, are all we need for this section.

2.2 Read alignment and variant calling

Once the required softwares are installed, you are ready to align your reads to the reference genome. First, create a directory called bios201 by typing mkdir bios201. You can do everything inside that directory. Then, get the reference fasta file MGI358.SNP.fa.zip (password protected due to administrative reasons), and put it in your directory. Unzip it by unzip MGI358.SNP.fa.zip. Then, build the index of the reference by typing the following command, which is required for the read alignment step later:

bwa index MGI358.SNP.fa

Once it is done, you should have some new files in the directory. Your directory should look like this:

bios201
├── MGI358.SNP.fa
├── MGI358.SNP.fa.amb
├── MGI358.SNP.fa.ann
├── MGI358.SNP.fa.bwt
├── MGI358.SNP.fa.pac
└── MGI358.SNP.fa.sa

Now you can put your own sequencing reads (the .fastq.gz file) to the bios201 directory and start the read alignment. If your experiment has failed and you do not have your fastq file, you can just use one of the following data from the 1000 Genome Project:

Those data are truncated to just contain a few thousand reads around the locations we are intested in. It is small so that your laptop can handle them without any problems. I will use HG00303_EUR.fastq.gz for the demonstration. Align the reads using the following steps:

# align the reads to the reference
bwa mem -M -Y -t 2 MGI358.SNP.fa HG00303_EUR.fastq.gz > HG00303_EUR_to_MGI.sam

# filter and convert sam to bam
samtools view -SbF 3844 HG00303_EUR_to_MGI.sam > HG00303_EUR_to_MGI.bam

# sort the bam file
samtools sort HG00303_EUR_to_MGI.bam -T tmp -o HG00303_EUR_to_MGI_sorted.bam

# index the bam file
samtools index HG00303_EUR_to_MGI_sorted.bam

If you are familiar with Unix command lines, those steps can be “piped” together and done in fewer lines of code without creating those intermediate files:

# align, filter, convert and sort
bwa mem -M -Y -t 2 MGI358.SNP.fa HG00303_EUR.fastq.gz | \
    samtools view -SuF 3844 - | \
    samtools sort - -T tmp -o HG00303_EUR_to_MGI_sorted.bam

# index the bam file
samtools index HG00303_EUR_to_MGI_sorted.bam

Finally, we could identify SNPs between your DNA and the reference using bcftools:

bcftools mpileup -f MGI358.SNP.fa HG00303_EUR_to_MGI_sorted.bam | \
    bcftools call -mv - > HG00303_EUR_to_MGI_all.vcf

The output is in a Variant Call Format (VCF). You can check the specification by clicking here. Your output file should look like this (I omitted a lot of header lines here):

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15.1+htslib-1.15.1
##bcftoolsCommand=mpileup -f MGI358.SNP.fa HG00303_EUR_to_MGI_sorted.bam
##reference=file://MGI358.SNP.fa
##contig=<ID=rs1490413,length=2084>
##contig=<ID=rs5745448,length=2078>
##contig=<ID=rs3737576,length=2079>
.
.
.
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00303_EUR_to_MGI_sorted.bam
rs1490413       1000    .       C       T       10.7923 .       DP=1;SGB=-0.379885;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60      GT:PL   1/1:40,3,0
rs1490413       1024    .       G       A       225.417 .       DP=253;VDB=0;SGB=-0.693147;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,253,0;MQ=59    GT:PL   1/1:255,255,0

Your experiments are done using a set of primers for targeted sequencing. It means you know what they are. Actually, the name of the record in the MGI358.SNP.fa file is the ID from the database dbSNP. You can see that the first column of your VCF file is basically the SNP ID (rsXXXX). This allows you to simply search the ID in the dbSNP database to see if we already have some knowledge about them.

Practice:

Examine the first 2 records in your VCF file in the dbSNP database.

2.3 Whole Genome Sequencing (WGS)

Another common scenario is that you are doing this type of experiment in a “de novo” way. You sequence your whole genome, try to identify all the SNPs in the genome and investigate if they have any effects. Now let’s perform the analysis in this scenario.

In order to do this, we need the reference sequence of the entire human genome. You can downlaod the fasta file from the UCSC Genome Browser, but the file size is very big and the interenet speed at our university is slow. I already prepared them for you. You can download the GRCh38.tar.gz file (file size: 4.1 GB) located on my workstation (campus access only) to your bios201 directory. Extract the file by simply typing tar zxvf GRCh38.tar.gz. Once it is done, you will get a directory called GRCh38 and inside it, there are seven files (a total of 8 GB):

GRCh38.fa
GRCh38.fa.amb
GRCh38.fa.ann
GRCh38.fa.bwt
GRCh38.fa.fai
GRCh38.fa.pac
GRCh38.fa.sa

The bwa index files are already there, so you don’t need to index the reference fasta again. Now, let’s pretend the fastq file you have is from the whole genome sequencing of yourself and map the reads to the whole genome and call variants. Basically, we are repreating the previous procedures using the whole genome as the reference:

# align, filter, convert and sort
bwa mem -M -Y -t 2 GRCh38/GRCh38.fa HG00303_EUR.fastq.gz | \
    samtools view -SuF 3844 - | \
    samtools sort - -T -o HG00303_EUR_to_GRCh38_sorted.bam

# index the bam file
samtools index HG00303_EUR_to_GRCh38_sorted.bam

# call variants
bcftools mpileup -f GRCh38/GRCh38.fa HG00303_EUR_to_GRCh38_sorted.bam | \
    bcftools call -mv - > HG00303_EUR_to_GRCh38_all.vcf

Now the new VCF file HG00303_EUR_to_GRCh38_all.vcf looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup -f GRCh38/GRCh38.fa HG00303_EUR_to_GRCh38_sorted.bam
##reference=file://GRCh38/GRCh38.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chrM,length=16569>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
.
.
.
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00303_EUR_to_GRCh38_sorted.bam
chr1    76344917        .       A       C       74      .       DP=3;VDB=0.431015;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,2,1;MQ=60       GT:PL   1/1:104,9,0
chr1    196145102       .       G       A       132     .       DP=6;VDB=0.0602347;SGB=-0.590765;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,4,1;MQ=60      GT:PL   1/1:162,15,0

The content in the current VCF file is slightly different from the previous one. The first column is now the chromosome names. The second column indicates the position of the SNP. Those are useful and important pieces of information. However, do the variants have any effect on our phenotypes? Do we know anything about them? In order to figue out that, we need to compare the variants to a database containing many SNPs that we know about. If the variant in your genome is present in the database, it is relative easy to find out what it does based on literatures.

You probably noticed that the values in the third column of the current VCF file are all .. Now let’s annotate the variant with the rsXXXX ID in the dbSNP database. First, we need the variants in the database in the VCF format. Go to the dbSNP database, then click the FTP Download at the right hand side. Go all the way to the organisms -> human_9606_b151_GRCh38p7 -> VCF directory, and the 00-All.vcf.gz file is what you need. However, it is really big, and I have prepared a subset that is enough for us to use in this section. You can download the truncated version (dbSNP_subset_formatted.vcf) to your own compouter and put it under the bios201 directory.

Before we could perform the annotation, we need to compress and index the VCF files using bgzip and tabix:

# compress with bgzip
bgzip HG00303_EUR_to_GRCh38_all.vcf
bgzip dbSNP_subset_formatted.vcf

# index the file
tabix -s 1 -b 2 -e 2 -p vcf HG00303_EUR_to_GRCh38_all.vcf.gz
tabix -s 1 -b 2 -e 2 -p vcf dbSNP_subset_formatted.vcf.gz

You will see the VCF files are compressed into the .gz files and for each of them, there is a .tbi index file. Now, you are ready to annotate your own VCF using bcftools:

bcftools annotate -c ID -O v \
                  -a dbSNP_subset_formatted.vcf.gz \
                  -o HG00303_EUR_to_GRCh38_all_annotated.vcf \
                  HG00303_EUR_to_GRCh38_all.vcf.gz

The program will look at the variants in your own VCF and the dbSNP database VCF files and compare them to see if they are the same. If yes, then the program will replace the . in the third column of your VCF with the ID from the dbSNP VCF (i.e. the rsXXXX ID). The advantage is that those IDs are unique and everybody is using them. You can easily communicate with others and look at the potential effect in the database.

Now look inside your annoated VCF file, do you notice something different? You probably see that the third column of a lot of the records is replaced by the rsXXXX ID, while some of them still remain as ., which means it does not match any record in the dbSNP database. There are many reasons for this, and you can think about this by yourself.

Practice

Find the SNP rs1821380 in your annotated VCF file, can you find it or not? If yes, what does the record tell you? If not, can you think of possible reasons?
Look it up in the dbSNP database to see if we know anything about this variant
What are the frequencies of the reference and alternative alleles, respectively

2.4 Visualisation of the read alignment

Very often, the most efficient way of getting a sense of your data is the least scientific way, that is, looking at it by eyes. To visualise our sequencing reads in the genome, you can use IGV: Integrative Genomics Viewer. Go to the Download page and get the program (with Java included) for your system. Lauch the program and load the bam file generated in the previous step. As long as the bam index (the bai file) is located in the same directory of your bam file, it should be loaded without any problem.

Practice

Look at the alignments around the following SNPs and see if your DNA is different from the reference genome:

rs1821380 (chr15:39021201)
rs1800414 (chr15:27951891)
rs671 (chr12:111803962)
rs885479 (chr16:89919746)