Introduction

Pretty much most of the exercise is obtained from:

Helminth Bioinformatics (Asia) - part of Wellcome Genome Campus Advanced Course

in module six Genetic Diversity

In the first part of the exercise using sandbox.io, you should have done a good job in tacking some terminal basics, mapping using bowtie2, and have a rough idea of what a sam/bam file look like (my god! it even has a wiki page). We now move onto the second part of the exercise where you start to deal with the population genetic data which are commonly analysed using R tools.

Overview and Aim

The re-sequencing of a genome typically aims to capture information on Single Nucleotide Polymorphisms (SNPs), INsertions and DELetions (INDELs) and Copy Number Variants (CNVs) between representatives of the same species, usually in cases where a reference genome already exists (at least for a very closely related species). Whether one is dealing with different bacterial isolates, with different strains of single-celled parasites, or indeed with genomes of different human individuals, the principles are essentially the same. Instead of assembling the newly generated sequence reads de novo to produce a new genome sequence, it is easier and much faster to align or map the new sequence data to the reference genome (please note that we will use the terms “aligning” and “mapping” interchangeably). One can then readily identify SNPs, INDELs, and CNVs that distinguish closely related populations or individual organisms and may thus learn about genetic differences that may cause drug resistance or increased virulence in pathogens, or changed susceptibility to disease in humans. One important prerequisite for the mapping of sequence data to work is that the reference and the re-sequenced subject have the same genome architecture.

As shown in Fig. 1, the aims of this module are to familiarize you with tools and concepts that will allow you to:

(First part) 1. map high-throughput sequencing reads to a genome; 2. bioinformatically identify and filter single nucleotide polymorphisms in your samples; 3. visualize sequencing reads and genetic variants in your samples;

(Second part) 1. analyze patterns of genetic diversity in your data, and link these patterns to metadata to uncover biological insights in your species.

Figure 1 - Sequencing pipeline



In this exercise, you will be analyzing genetic variation in the gastrointestinal helminth Haemonchus contortus. H. contortus is an important pathogen of wild and domesticated ruminants worldwide, and has a major impact on the health and economic viability of sheep and goat farming in particular. It is also a genetically tractable model used for drug discovery, vaccine development, and anthelmintic resistance research. A chromosome-scale reference genome assembly and manually curated genome annotation are both available to download and explore at WormBase Parasite. The sequencing data you will be using in this module is from two recently published studies - Salle et al 2019 Nature Communications and Doyle et al. 2020 Communications Biology - which describe the global and genome-wide genetic diversity of H. contortus, all of which was generated at the Wellcome Sanger Institute.

Analysis of global diversity allows you to understand aspects of the species biology, such as how different populations are connected (which may be important to understand the spread of a pathogen or ongoing transmission), whether populations are growing or declining (perhaps in response to drug treatment), or the impact of selection on regions or specific genes throughout the genome. Although whole genome sequencing data was generated for these samples, we have extracted only the mitochondrial DNA-derived reads for you to work with. The main reason for this is that at this scale, the data should be able to be analysed efficiently on your computer without the need for high performance computing infrastructure and/or capacity.

Installation

You need to first install the packages successfully for the whole tutorial to work

# Try install the following:
#If there's an option during the install, type yes.
#for example, do you want to install from sources the package which needs compilation? (Yes/no/cancel) 
#type yes and enter

# In the more intermediate level of R, you will find that you use quite a lot of packages that 
# are already available for a given task. A package may also consists of additional packages. So it 
# likely that you will install a lot of packages in the end.

install.packages("vcfR")


install.packages("tidyverse")
install.packages("reshape2")
install.packages("RColorBrewer")
install.packages("patchwork")
install.packages("ggrepel")
install.packages("poppr")
install.packages("maps")
install.packages("mapplots")


# for adegenet to work in Mac users, Please make sure: 
# 1. you use R-4.1.1.pkg without the ARM64 version
# 2. Also download and install gfortran 11.2 for BigSur (macOS 11) Intel
#    from https://github.com/fxcoudert/gfortran-for-macOS/releases
install.packages("adegenet")


# Sometimes there are additional package managers for specific disciplines which allow you to install
# packages that is not available using the install.packages command. In our case, 
# Bioconductor would be a good place to search for packages in 

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ggtree")

Initiate libraries

library(tidyverse)
library(ggplot2)
library(vcfR)
library(adegenet)
library(RColorBrewer)
library(patchwork)
library(ggrepel)
library(poppr)
library(ggtree)
library(reshape2)
library(maps)
library(mapplots)

Data import

Now we want to read in a vcf file into R. You should have already seen from the sandbox.io exercise what a vcf (note: bcf is the compressed version of vcf) file looks like. Figure 2 gives some explanation of this file.

Figure 2 - Exploring the VCF file format



# Lets specify your input files that we will load into R
#-- the vcf contain the SNP data your generated with bcftools
vcf_file  <-  "https://raw.githubusercontent.com/ishengtsai/IntroToGenomics/master/exercises/all_samples.filtered.recode.vcf"


#-- metadata that describes information about the samples, such as country of origin, and GPS coordinates
metadata_file <- "https://raw.githubusercontent.com/ishengtsai/IntroToGenomics/master/exercises/sample_metadata.txt"


# Read the actual data into R
vcf <- read.vcfR(vcf_file, verbose = FALSE)
metadata <- read.table(metadata_file, header = TRUE)


# Analysis packages tend to convert files to their own formats so we will use can interpret easily.
# We will use a “genlight” format, which is good for storing variant call data. 
vcf.gl <- vcfR2genlight(vcf)

# In order to use the function properly, like the case of vcfR2genlight, a lot of parameters need 
# to be specified.
# We will add the country IDs 
# We will also set the ploidy to 1
pop(vcf.gl) <- metadata$country
ploidy(vcf.gl) <- 1

Let’s have a look at how the data is formatted

vcf.gl
##  /// GENLIGHT OBJECT /////////
## 
##  // 176 genotypes,  1,656 binary SNPs, size: 478.4 Kb
##  0 (0 %) missing data
## 
##  // Basic content
##    @gen: list of 176 SNPbin
##    @ploidy: ploidy of each individual  (range: 1-1)
## 
##  // Optional content
##    @ind.names:  176 individual labels
##    @loc.names:  1656 locus labels
##    @chromosome: factor storing chromosomes of the SNPs
##    @position: integer storing positions of the SNPs
##    @pop: population of each individual (group size range: 5-23)
##    @other: a list containing: elements without names

Have a close look at how the data is store in this object, for example

# here we use the function head() to wrap around the object, so it's only printing a subset of data
# you can try to inspect vcf.gl@ind.names and vcf.gl@pop without head
head(vcf.gl@ind.names)
## [1] "AUS_WAL_OA_001.sorted.bam" "AUS_WAL_OA_002.sorted.bam"
## [3] "AUS_WAL_OA_003.sorted.bam" "AUS_WAL_OA_004.sorted.bam"
## [5] "AUS_WAL_OA_005.sorted.bam" "AUS_WAL_OA_006.sorted.bam"
head(vcf.gl@pop)
## [1] Australia Australia Australia Australia Australia Australia
## 17 Levels: Australia Benin Brazil Cape_Verde DRC France ... United_Kingdom

Ask yourself: What does the file metadata_file look like?

Principal component analysis of genetic diversity

# Perform a PCA analysis, and we’ll have a look at it
vcf.pca <- glPca(vcf.gl, nf = 10)
vcf.pca

Figure 3 - Understanding the data generated by PCA

Next, we will extract the scores for each PC in preparation for making some figures, and add the country information to allow us to explore the data a little better

# convert scores of vcf.pca into a tibble
vcf.pca.scores <- as_tibble(vcf.pca$scores)

# add the country data into a column of vcf.pca.scores tibble
vcf.pca.scores$country <- metadata$country 


# We will also determine the variance each PC contributes the data, which will help us understand potential drivers of patterns in our dataset. Lets plot the eigenvectors to try an understand this a bit more.

barplot(100 * vcf.pca$eig / sum(vcf.pca$eig), col="green")
title(ylab = "Percent of variance explained") 
title(xlab = "Eigenvalues")

Figure 4 - Loading plot from the PCA

# Lets extract the variance associated with the top 4 PCs, so we can use them in our plots.

# first we sum all the eigenvalues
eig.total <- sum(vcf.pca$eig)

# sum the variance
PC1.variance <- formatC(head(vcf.pca$eig)[1]/eig.total * 100)
PC2.variance <- formatC(head(vcf.pca$eig)[2]/eig.total * 100)
PC3.variance <- formatC(head(vcf.pca$eig)[3]/eig.total * 100)
PC4.variance <- formatC(head(vcf.pca$eig)[4]/eig.total * 100)


# Lets check that this has worked

PC1.variance 
## [1] "36.96"

This suggests that PC1 describes 36.96% of the variance in the data, which is consistent with our previous plot.

OK, time to visualize our data and make some plots! Let’s build a plot of your data using ggplot, and explore how to incorporate additional information into the plot to make it more informative. Ggplot works by adding layers of information (hence the “+”) to build the plot.

# plot PC1 and PC2 with each sample as a point
plot12 <- ggplot(vcf.pca.scores, aes(PC1, PC2)) + geom_point()
plot12

# We’ll add some axis labels, and incorporate the variance information to describe 
# the relative importance of the spread of the data
# paste0() esseentially connects the strings together
xlabel <- paste0("PC1 variance = ",PC1.variance,"%")
ylabel <- paste0("PC2 variance = ", PC2.variance, "%")
plot12 <- plot12 + labs(x = xlabel, y = ylabel)
plot12

Finally, let’s make the figures more informative by coloring by Country

# We need some labels to describe the country of origin. 
# We will also set some colours 
cols <- colorRampPalette(brewer.pal(8, "Set1"))(17)
plot12 <- plot12 + geom_point(aes(col = country)) + scale_colour_manual(values=cols) 
plot12

Figure 5 - Viewing your PCA analysis

Now we are starting to get somewhere. Lets have a look and see what the data is telling us so far.

# Lets quickly look at PC3/PC4, and compare to the first plot.

plot34 <- ggplot(vcf.pca.scores, aes(PC3, PC4)) + 
    geom_point(aes(col = country)) + 
    labs(x = paste0("PC3 variance = ", PC3.variance,"%"), y = paste0("PC4 variance = ", PC4.variance, "%")) + 
    scale_colour_manual(values = cols) 

plot12 + plot34

Note: You may have to change the plot dimension size by dragging the window size to make it wider.

Question: How do these plots compare? What is the relative contribution of variance in the PC3/PC4 plot compared to the PC1/PC2 plot?

Some patterns are starting to emerge regarding the genetic relatedness within and between countries. However, it may be difficult to see some of the subtle features of the diversity that may be important. Lets explore the data in a slightly different way.

# Calculate the mean value of the principal components for each country. 
# We can use this to make some labels for our plots
means <- vcf.pca.scores %>% group_by(country) %>% 
  summarize(meanPC1 = mean(PC1), meanPC2 = mean(PC2),meanPC3 = mean(PC3), meanPC4 = mean(PC4))
# Lets make a slightly different plot that our first comparison of PC1 and PC2, 
plot12.2 <- ggplot(vcf.pca.scores, aes(PC1, PC2, col =  country)) + 
    labs(x = paste0("PC1 variance = ", PC1.variance, "%"), y = paste0("PC2 variance = ", PC2.variance, "%")) + 
    scale_colour_manual(values = cols) +
    stat_ellipse(level = 0.95, size = 1) +
    geom_label_repel(data = means,
    aes(means$meanPC1, means$meanPC2, col = means$country, label = means$country))

plot12 + plot12.2

In our new plot, we have added an ellipse that describes how the individual samples per country are distributed in the plot. We have also added country labels, which are positioned on the plot using the mean PC values we calculated earlier. We expect that if all samples within a country are genetically similar, we should see a small ellipse. However, if samples a not genetically similar, we will see a large ellipse.

Exploring genetic data using phylogenetic trees

PCA is a great way to explore complex datasets, including genomics data, and can help to identify drivers (sometimes even technical biases) that are shaping genetic differences between samples. However, it is a data reduction approach, and sometimes interpreting PCAs can be cryptic. Moreover, it is not a direct measure of genetic differentiation.

A more common approach to directly compare samples is to perform a pairwise analysis of genetic differences, and to visualise them using a phylogenetic tree. This is the next step in our analysis, and we will compare these results to the PCAs.

# Generated pairwise distances between samples that we will plot in a tree format
tree_data <- aboot(vcf.gl, tree = "upgma", distance = bitwise.dist, sample = 100, showtree = F, cutoff = 50) 
## 
Running bootstraps:       100 / 100
## Calculating bootstrap values... done.
#--- make and plot the tree 
tree_plot <- ggtree(tree_data) + 
    geom_tiplab(size = 2, color = cols[pop(vcf.gl)]) + 
    xlim(-0.1, 0.3) + 
    geom_nodelab(size = 2, nudge_x = -0.006, nudge_y = 1) + 
    theme_tree2(legend.position = 'centre')

tree_plot

Figure 6 - Analysis of pairwise distance using a tree

Integrating genetic and geographic data: maps

Here, we will make a map of the sampling locations, and plot the allele frequency data on it. This or similar may be used to explore how populations may be connected to each other. We will explore this by plotting SNPs that seem to have the most effect in driving the variance in the PCA plot.

Note that we will only be looking at one variant at a time, and the genetic signal that differentiate these populations is made up of many variants. However, it should give you an idea of what could be done integrating these data types.

First, lets calculate allele frequencies per country, and integrate this with the latitude and longitude coordinates to prepare to plot.

# Calculate allele frequencies per country
myDiff_pops <- genetic_diff(vcf,pops = vcf.gl@pop)
AF_data <- myDiff_pops[,c(1:19)]
AF_data <- melt(AF_data)
colnames(AF_data) <- c("CHROM","POS","country","allele_frequency")
AF_data$country <- gsub("Hs_","", AF_data$country)


# extract the latitude and longitude for each country from the metadata file
coords <- data.frame(metadata$country, metadata$latitude, metadata$longitude)
coords <- unique(coords)
colnames(coords) <- c("country","latitude","longitude")


# join the allele frequency data and the latitude/longitude data together
AF_data_coords <- dplyr::left_join(AF_data, coords, by = "country")


# lets have a look at the new data.
head(AF_data_coords)
##                              CHROM POS   country allele_frequency  latitude
## 1 hcontortus_chr_mtDNA_arrow_pilon  24 Australia             0.00 -28.90242
## 2 hcontortus_chr_mtDNA_arrow_pilon  32 Australia             0.18 -28.90242
## 3 hcontortus_chr_mtDNA_arrow_pilon  33 Australia             0.18 -28.90242
## 4 hcontortus_chr_mtDNA_arrow_pilon  48 Australia             0.00 -28.90242
## 5 hcontortus_chr_mtDNA_arrow_pilon  51 Australia             0.00 -28.90242
## 6 hcontortus_chr_mtDNA_arrow_pilon  54 Australia             0.48 -28.90242
##   longitude
## 1  151.8764
## 2  151.8764
## 3  151.8764
## 4  151.8764
## 5  151.8764
## 6  151.8764

Lets make a map, and plot the sampling locations on it. Your map should look a bit like the one below.

# map() function is slighly different to other functions. Here the code is adding one layer at a time,
# first the map, then axes, then points, and finally legends
par(fg = "black")
map("world", col = "grey85", fill = TRUE, border = FALSE)
map.axes()
points(metadata$longitude, metadata$latitude, cex = 1.5, pch = 20, col = cols[pop(vcf.gl)])
legend( x = "left", legend = unique(pop(vcf.gl)), 
        col = cols[unique(pop(vcf.gl))], lwd = "1", lty = 0,    pch = 20, box.lwd = 0, cex = 1)

We need to decide on which SNP(s) we want to plot. One approach might be to identify the variants that seem to have the greatest influence on the PC1 and PC2 variance. We can identify these in the “loadings” data set that was generated when we ran the PCA.

# we can find the loadings in the PCA of our SNP data
vcf.pca 

Figure 7 - Extracting SNP loadings from the PCA

We will make a new tibble, containing the SNP names and the loadings for the first two PCs..

snp_loadings <- as_tibble(data.frame(vcf.gl@loc.names, vcf.pca$loadings[,1:2]))
snp_loadings
## # A tibble: 1,656 × 3
##    vcf.gl.loc.names                         Axis1     Axis2
##    <chr>                                    <dbl>     <dbl>
##  1 hcontortus_chr_mtDNA_arrow_pilon_24   0.00685   0.00305 
##  2 hcontortus_chr_mtDNA_arrow_pilon_32  -0.00866  -0.0330  
##  3 hcontortus_chr_mtDNA_arrow_pilon_33  -0.0482   -0.0111  
##  4 hcontortus_chr_mtDNA_arrow_pilon_48  -0.000319 -0.00944 
##  5 hcontortus_chr_mtDNA_arrow_pilon_51  -0.00590  -0.0122  
##  6 hcontortus_chr_mtDNA_arrow_pilon_54   0.00288   0.000312
##  7 hcontortus_chr_mtDNA_arrow_pilon_60  -0.0478   -0.00200 
##  8 hcontortus_chr_mtDNA_arrow_pilon_78  -0.0485   -0.00560 
##  9 hcontortus_chr_mtDNA_arrow_pilon_111 -0.0120   -0.0434  
## 10 hcontortus_chr_mtDNA_arrow_pilon_117 -0.00151  -0.0329  
## # … with 1,646 more rows

and can arrange the SNP loadings by the Axis 1 using the following:

snp_loadings %>% arrange(desc(Axis1))
## # A tibble: 1,656 × 3
##    vcf.gl.loc.names                        Axis1     Axis2
##    <chr>                                   <dbl>     <dbl>
##  1 hcontortus_chr_mtDNA_arrow_pilon_7859  0.0514  0.00863 
##  2 hcontortus_chr_mtDNA_arrow_pilon_5006  0.0507  0.0102  
##  3 hcontortus_chr_mtDNA_arrow_pilon_13524 0.0474  0.00795 
##  4 hcontortus_chr_mtDNA_arrow_pilon_6881  0.0473  0.00577 
##  5 hcontortus_chr_mtDNA_arrow_pilon_6152  0.0472  0.00836 
##  6 hcontortus_chr_mtDNA_arrow_pilon_8428  0.0468  0.00678 
##  7 hcontortus_chr_mtDNA_arrow_pilon_12307 0.0466 -0.000488
##  8 hcontortus_chr_mtDNA_arrow_pilon_9172  0.0449  0.000482
##  9 hcontortus_chr_mtDNA_arrow_pilon_2947  0.0436  0.00466 
## 10 hcontortus_chr_mtDNA_arrow_pilon_8332  0.0432  0.0102  
## # … with 1,646 more rows

The SNP at position 7859 has the highest correlation to PC1. We will then look and see how the allele frequency of this SNP is distributed on our map.

# select a SNP of interest based on its position 
AF_SNP_coords <- AF_data_coords[AF_data_coords$POS == "7859",]


# Remake your map, but this time, we’ll add a pie chart describing 
# the population allele frequency per country. 
par(fg = "black")
map("world", col = "grey85", fill = TRUE, border = FALSE)
map.axes()
points(metadata$longitude, metadata$latitude, cex = 1.5, pch = 20, col = cols[pop(vcf.gl)])

for (i in 1:nrow(AF_SNP_coords)){ 
    add.pie(z = c(AF_SNP_coords$allele_frequency[i], 
    1-AF_SNP_coords$allele_frequency[i]), 
    x = AF_SNP_coords$longitude[i]+10, 
    y = AF_SNP_coords$latitude[i], 
    radius = 5, col = c(alpha("orange", 0.5), alpha("blue", 0.5)), labels = "") 
    }

legend(title="Country", x = "topleft", 
    legend = unique(pop(vcf.gl)), 
    col = cols[unique(pop(vcf.gl))], pch = 20, 
    box.lwd = 0, cex = 0.9)
  
legend(title="Allele frequency", x = "bottomleft", 
    legend = c("reference","variant"), 
    col = c(alpha("blue", 0.5), alpha("orange", 0.5)), pch = 15, box.lwd = 0, cex = 0.9)

You should now have a map containing both sampling locations and pie charts showing the variant allele frequency of the SNP at position 7859.

Summary

In this module, we have shown you how to:

  • map and call variants from Illumina sequencing data in a single sample and a cohort of samples
  • visualize this data in the genome browser Artemis
  • Perform some basic data exploration and population genetics using R to understand the genetic relatedness within and between samples