DEseq2 Tutorial

Introduction

Pretty much most of the information are obtained from:

Analyzing RNA-seq data with DESeq2 by Michael I. Love, Simon Anders, and Wolfgang Huber Last updated 05/19/2021

as well as some more detailed explanations in

Transcriptomics module from Helminth Bioinformatics (Asia), 2021

Another more comprehensive paper on RNAseq analysis is Dundal et al’s Introduction to differential gene expression analysis using RNA-seq

One of the most common uses of transcriptomic data is possibly for differential gene expression study, which will be covered in this course. However, the extensive and high-throughput nature of the transcriptomic data means there are other potential usages. For example, it can be used to profile total RNA (e.g. miRNA and mRNA) in exosomes and other secretory products; help identify different splice isoforms; provide evidence for gene annotation and improve quality of reference genomes. Meta-transcriptome (combining and re-analysing pool of transcriptomics data from multiple experiments), and comparative gene expression between species could be seen as an extension of differential gene expression. Furthermore, genetic variation particularly SNP calling could use information from transcriptomics data which would carry SNPs from transcribed genes.

At the end of the practical, ou will have hands-on experience in visualising transcriptomic profiles and using R packages to identify differentially expressed genes and finding patterns in the data.

Installation

You need to first install the packages successfully for the whole tutorial to work

# Takes about 10 minutes
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("pasilla")
BiocManager::install("DESeq2")
BiocManager::install("apeglm")
BiocManager::install("DEGreport")
install.packages("pheatmap")
install.packages("tidyverse")

Initiate libraries

library("pasilla")
library("tidyverse")
library("DESeq2")
library("pheatmap")
library("RColorBrewer")
library("apeglm")

library("DEGreport")
#Note: In some versions of Mac DEGreport will not be loaded successfully,
#Try install Xquarz, restart Mac and R studio, and try to load again
# https://www.xquartz.org/

About DESeq2

This is an R package for performing differential expression analysis (PMID: 25516281; last time I checked it’s been cited 30k times!). It can take read count data in various forms, one of those is read count tables from HTSeq-count. The tool provides simple command lines for formatting read count data, normalization, exploring variances between samples, and performing differential expression analysis. It is one of the tools widely used for RNA-seq data analysis and it also provide detailed manual which help make it more user-friendly (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)

Alternative softwares

In addition to DESeq2, there are a variety of programs for detecting differentially expressed genes from tables of RNA-seq read counts. Some of these tools work in R, while some require Unix interface. Examples of these tools include EdgeR (PMID: 19910308), BaySeq (PMID: 20698981), Cuffdiff (PMID: 23222703), Sleuth PMID: 28581496 (an accompanying tool for read count data from Kallisto).

Data import

To demonstate the use of DESeqDataSetFromMatrix, we will read in count data from the pasilla package. We read in a count matrix, which we will name cts, and the sample information table, which we will name coldata.

library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)

What does the original data look like?

# gene raw counts
as_tibble(cts)

## # A tibble: 14,599 × 7
##    untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
##         <int>      <int>      <int>      <int>    <int>    <int>    <int>
##  1          0          0          0          0        0        0        1
##  2         92        161         76         70      140       88       70
##  3          5          1          0          0        4        0        0
##  4          0          2          1          2        1        0        0
##  5       4664       8714       3564       3150     6205     3072     3334
##  6        583        761        245        310      722      299      308
##  7          0          1          0          0        0        0        0
##  8         10         11          3          3       10        7        5
##  9          0          1          0          0        0        1        1
## 10       1446       1713        615        672     1698      696      757
## # … with 14,589 more rows

#metadata
coldata

##              condition        type
## treated1fb     treated single-read
## treated2fb     treated  paired-end
## treated3fb     treated  paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated  paired-end
## untreated4fb untreated  paired-end

Note that these are not in the same order with respect to samples!

It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.

rownames(coldata)

## [1] "treated1fb"   "treated2fb"   "treated3fb"   "untreated1fb" "untreated2fb"
## [6] "untreated3fb" "untreated4fb"

colnames(cts)

## [1] "untreated1" "untreated2" "untreated3" "untreated4" "treated1"  
## [6] "treated2"   "treated3"

# first check if the row names in coldata are the same as the column names of cts data
# Should return the result false
all(rownames(coldata) %in% colnames(cts))

## [1] FALSE

#We need to chop off the "fb" of the row names of coldata, so the naming is consistent.
rownames(coldata) <- sub("fb", "", rownames(coldata))
#check again
all(rownames(coldata) %in% colnames(cts))

## [1] TRUE

As the names of columns are not in the correct order as given, we need to re-arrange one or the other so that they are consistent in terms of sample order (if we do not, later functions would produce an error).

# check if order is the same
all(rownames(coldata) == colnames(cts))

## [1] FALSE

# reorder cts's columns based on row order of metadata (coldata)  
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))

## [1] TRUE

With the count matrix, cts, and the sample information, coldata, we can construct a DESeqDataSet:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds

## class: DESeqDataSet 
## dim: 14599 7 
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type

If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object. (Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)

featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData)
mcols(dds)

## DataFrame with 14599 rows and 1 column
##                    gene
##             <character>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ...                 ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575

Pre-filtering

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total. Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

Note on factor levels

By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast argument (this will be shown later), or you can explicitly set the factors levels. In order to see the change of reference levels reflected in the results names, you need to either run DESeq or nbinomWaldTest/nbinomLRT after the re-leveling operation. Setting the factor levels can be done in two ways, either using factor:

dds$condition <- factor(dds$condition, levels = c("untreated","treated"))

…or using relevel, just specifying the reference level:

dds$condition <- relevel(dds$condition, ref = "untreated")

About the pasilla dataset

We continue with the pasilla data constructed from the count matrix method above. This data set is from an experiment on Drosophila melanogaster cell cultures and investigated the effect of RNAi knock-down of the splicing factor pasilla (Brooks et al. 2011). The detailed transcript of the production of the pasilla data is provided in the vignette of the data package pasilla

Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described below, in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.

Details about the comparison are printed to the console, directly above the results table. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold change log2(treated/untreated).

In its simplest view, differential expression analysis test for differences between 2 or more groups when we have one or more factors that could contribute to the observed differences. However, in transcriptomic data, we often have a much lower number of replicates, data are not normally distributed, and we are testing differences in many genes at the same time. Methods and tools for analysing differential gene expression; therefore, have some specific terminology and concepts as we shall see next!

Genes differentially expressed between conditions

Using the DESeq dds object we created earlier, we can look at the differentially expressed genes using results() function. By default, DESeq2 perform pair-wise comparison of the first and the last variable in the experimental design variables and provide a result table.

## probably the most important command :)
dds <- DESeq(dds)

res <- results(dds)
res

## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 9921 rows and 6 columns
##               baseMean log2FoldChange     lfcSE       stat    pvalue      padj
##              <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
## FBgn0000008   95.14429     0.00227644  0.223729   0.010175 0.9918817  0.997211
## FBgn0000014    1.05652    -0.49512039  2.143186  -0.231021 0.8172987        NA
## FBgn0000017 4352.55357    -0.23991894  0.126337  -1.899041 0.0575591  0.288002
## FBgn0000018  418.61048    -0.10467391  0.148489  -0.704927 0.4808558  0.826834
## FBgn0000024    6.40620     0.21084779  0.689588   0.305759 0.7597879  0.943501
## ...                ...            ...       ...        ...       ...       ...
## FBgn0261570 3208.38861      0.2955329  0.127350  2.3206264  0.020307  0.144240
## FBgn0261572    6.19719     -0.9588230  0.775315 -1.2366888  0.216203  0.607848
## FBgn0261573 2240.97951      0.0127194  0.113300  0.1122634  0.910615  0.982657
## FBgn0261574 4857.68037      0.0153924  0.192567  0.0799327  0.936291  0.988179
## FBgn0261575   10.68252      0.1635705  0.930911  0.1757102  0.860522  0.967928

The result table contains several columns, of which the most relevant are:

Rowname: indicating gene id
Column 1: baseMean, average expression level across all samples normalised by sequencing depth
Column 2: log2FoldChange, in this table above, of treated vs untreated
Column 6: padj, adjusted p-value, p-value corrected for multiple testing

Note that we could have specified the coefficient or contrast we want to build a results table for, using either of the following equivalent commands:

res <- results(dds, name="condition_treated_vs_untreated")
res <- results(dds, contrast=c("condition","treated","untreated"))

Log fold change shrinkage for visualization and ranking

Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. To shrink the LFC, we pass the dds object to the function lfcShrink. Below we specify to use the apeglm method for effect size shrinkage (Zhu, Ibrahim, and Love 2018), which improves on the previous estimator.

We provide the dds object and the name or number of the coefficient we want to shrink, where the number refers to the order of the coefficient as it appears in resultsNames(dds).

resultsNames(dds)

## [1] "Intercept"                      "condition_treated_vs_untreated"

resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
# noticed any differences to the values in lfcSE compared to res ?
resLFC

## log2 fold change (MAP): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 9921 rows and 5 columns
##               baseMean log2FoldChange     lfcSE    pvalue      padj
##              <numeric>      <numeric> <numeric> <numeric> <numeric>
## FBgn0000008   95.14429     0.00119920  0.151897 0.9918817  0.997211
## FBgn0000014    1.05652    -0.00473412  0.205468 0.8172987        NA
## FBgn0000017 4352.55357    -0.18989990  0.120377 0.0575591  0.288002
## FBgn0000018  418.61048    -0.06995753  0.123901 0.4808558  0.826834
## FBgn0000024    6.40620     0.01752715  0.198633 0.7597879  0.943501
## ...                ...            ...       ...       ...       ...
## FBgn0261570 3208.38861     0.24110290 0.1244469  0.020307  0.144240
## FBgn0261572    6.19719    -0.06576173 0.2141351  0.216203  0.607848
## FBgn0261573 2240.97951     0.01000619 0.0993764  0.910615  0.982657
## FBgn0261574 4857.68037     0.00843552 0.1408267  0.936291  0.988179
## FBgn0261575   10.68252     0.00809101 0.2014704  0.860522  0.967928

p-values and adjusted p-values

We can order our results table by the smallest p value:

resOrdered <- res[order(res$pvalue),]

We can summarize some basic tallies using the summary function.

summary(res)

## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 518, 5.2%
## LFC < 0 (down)     : 536, 5.4%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1539, 16%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha. By default the argument alpha is set to 0.1. If the adjusted p value cutoff will be a value other than 0.1, alpha should be set to that value:

res05 <- results(dds, alpha=0.05)
summary(res05)

## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 407, 4.1%
## LFC < 0 (down)     : 431, 4.3%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1347, 14%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

sum(res05$padj < 0.05, na.rm=TRUE)

## [1] 838

More exaplanation: Log2 fold change

Fold change is calculated from a ratio of normalised read counts between two conditions of interest. However, level of gene expression changes are often shown as log2 fold change. Using log2 value become particularly helpful for visualising the gene expression changes. Furthermore, it eventually become intuitive that log2FC of 1 means expression level double in a given condition, and a negative log2FC means the gene is down-regulated in a given condition.

More exaplanation: P-values, q-values and multiple-testing problem

We might be familiar with using p-value as a statistically significant cut-off (such as at p-value = 0.05) when you study one gene. Essentially, p-value of 0.05 means there is a 5% chance that the difference observed is due to chance (false-positive). Now, when we talk about transcriptome , we can be looking at 10,000 genes at once and the chance of false-positive hits, collectively, then become 0.05 * 10,000 = 500 hits! That is too much to obtain reliable results, and this is known as a multiple-testing problem - the more you carry out a test, the more likely you would end up with some positive results just by chance. To overcome this, we will, instead, use q-value, also known as adjusted p-value, which can be calculated from the p-value, its distribution, and taking the number of tests into account. Thankfully, we often don’t need to do the calculation ourselves, as data analysis tools often provide the adjusted p-value. Therefore it is an adjusted p-value of less than 0.05 (or other cut-off) that we should be looking for when asking whether a gene is differentially expressed.

Exploring and exporting results

MA-plot

In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

plotMA(res, ylim=c(-2,2))

It is more useful visualize the MA-plot for the shrunken log2 fold changes, which remove the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds.

plotMA(resLFC, ylim=c(-2,2))

Plot counts

It can also be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts, which normalizes counts by sequencing depth and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup, where more than one variable can be specified. Here we specify the gene which had the smallest p value from the results table created above. You can select the gene to plot by rowname or by numeric index.

plotCounts(dds, gene=which.min(res$padj), intgroup="condition")

For customized plotting, an argument returnData specifies that the function should only return a data.frame for plotting with ggplot.

d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", 
                returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))

More information on results columns

Information about which variables and tests were used can be found by calling the function mcols on the results object.

mcols(res)$description

## [1] "mean of normalized counts for all samples"             
## [2] "log2 fold change (MLE): condition treated vs untreated"
## [3] "standard error: condition treated vs untreated"        
## [4] "Wald statistic: condition treated vs untreated"        
## [5] "Wald test p-value: condition treated vs untreated"     
## [6] "BH adjusted p-values"

For a particular gene, a log2 fold change of -1 for condition treated vs untreated means that the treatment induces a multiplicative change in observed gene expression level of 2−1=0.5

compared to the untreated condition. If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable.

Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:

If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance.
If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA.

Exporting results to CSV files

A plain-text file of the results can be exported using the base R functions write.csv or write.delim. We suggest using a descriptive file name indicating the variable and levels which were tested.

write.csv(as.data.frame(resOrdered), 
          file="condition_treated_results.csv")

Exporting only the results which pass an adjusted p value threshold can be accomplished with the subset function, followed by the write.csv function.

resSig <- subset(resOrdered, padj < 0.1)
resSig

## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 1054 rows and 6 columns
##              baseMean log2FoldChange     lfcSE      stat       pvalue
##             <numeric>      <numeric> <numeric> <numeric>    <numeric>
## FBgn0039155   730.568       -4.61874 0.1691240  -27.3098 3.24447e-164
## FBgn0025111  1501.448        2.89995 0.1273576   22.7701 9.07165e-115
## FBgn0029167  3706.024       -2.19691 0.0979154  -22.4368 1.72030e-111
## FBgn0003360  4342.832       -3.17954 0.1435677  -22.1466 1.12417e-108
## FBgn0035085   638.219       -2.56024 0.1378126  -18.5777  4.86845e-77
## ...               ...            ...       ...       ...          ...
## FBgn0037073  973.1016      -0.252146 0.1009872  -2.49681    0.0125316
## FBgn0029976 2312.5885      -0.221127 0.0885764  -2.49645    0.0125443
## FBgn0030938   24.8064       0.957645 0.3836454   2.49617    0.0125542
## FBgn0039260 1088.2766      -0.259253 0.1038739  -2.49585    0.0125656
## FBgn0034753 7775.2711       0.393515 0.1576749   2.49574    0.0125696
##                     padj
##                <numeric>
## FBgn0039155 2.71919e-160
## FBgn0025111 3.80147e-111
## FBgn0029167 4.80595e-108
## FBgn0003360 2.35542e-105
## FBgn0035085  8.16049e-74
## ...                  ...
## FBgn0037073    0.0999489
## FBgn0029976    0.0999489
## FBgn0030938    0.0999489
## FBgn0039260    0.0999489
## FBgn0034753    0.0999489

Data transformations and visualizationå

Count data transformations

In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data.

Extracting transformed values

These transformation functions return an object of class DESeqTransform which is a subclass of RangedSummarizedExperiment. For ~20 samples, running on a newly created DESeqDataSet. The assay function is used to extract the matrix of normalized values.

vsd <- vst(dds)
head(assay(vsd), 3)

##              treated1  treated2  treated3 untreated1 untreated2 untreated3
## FBgn0000008  7.207377  7.486014  7.191410   7.156899   7.249802   7.497780
## FBgn0000014  5.520677  5.142084  5.142084   5.648302   5.323248   5.142084
## FBgn0000017 11.915971 12.003507 11.992355  12.024976  12.267055  12.440309
##             untreated4
## FBgn0000008   7.283112
## FBgn0000014   5.142084
## FBgn0000017  12.057108

Data quality assessment by sample clustering and visualization

Data quality assessment and quality control (i.e. the removal of insufficiently good data) are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.

We define the term quality as fitness for purpose. Our purpose is the detection of differentially expressed genes, and we are looking in particular for samples whose experimental treatment suffered from an anormality that renders the data points obtained from these particular samples detrimental to our purpose.

Gene heatmap We may want to look at expression profiles of multiple genes at the same time. Heatmaps can be useful for this purpose; it essentially help turn a table of numbers into a mode visual form and it is versatile. The table could be normalised counts of top differentially expressed genes in a given comparison, or it could be genes known to be involved in a specific pathway, or it can be log2FC values instead of read counts. Previously, we use heatmap to visualise matrix of distances between each samples.

select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)

Heatmap of the sample-to-sample distances

Another use of the transformed data is sample clustering. Here, we apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances.

sampleDists <- dist(t(assay(vsd)))

A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. We have to provide a hierarchical clustering hc to the heatmap function based on the sample distances, or else the heatmap function would calculate a clustering based on the distances between the rows/columns of the distance matrix.

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

Principal component plot of the samples

Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.

The Principal Components Analysis (PCA) plot shows the relationship between the samples in two dimensions. Hopefully, technical and biological replicates would show similar expression patterns (i.e. they are grouped together tightly on the PCA plot); whereas, samples from different experimental conditions, or distinctive time points, would form separate groups. In some cases we can identify outliers, e.g. samples which do not agree with other replicates and these can be excluded. If we don’t have many replicates, it is hard to detect outliers and our power to detect differentially expressed genes is reduced.

I also highly recommend this short video (Principal Component Analysis (PCA) clearly explained (2015)) to help understand the basis of the approach.

plotPCA is a function in DESeq2 package. R also have internal PCA function, but the usage is more complicated intgroup indicate how we want to present each data point on the plot. In this case, we label data points based on their groups under the column condition and type of the meta data

plotPCA(vsd, intgroup=c("condition", "type"))

r4444s13g6yqｘｅ一誒ㄐㄤｘｅ一誒ㄐㄤ

Additional analysis

There are many packages that can work with DEseq2 objects. We will use DEGreport as an example for additional plot

QCs, figures and analyses after differential expression with DESeq2 or other similar tool

Contrasts

DEGSet is a class to store the DE results like the one from results function. DESeq2 offers multiple way to ask for contrasts/coefficients. With degComps is easy to get multiple results in a single object:

degs <- degComps(dds, combs = "condition",
                 contrast = list( c("condition", "untreated", "treated") ))
names(degs)

## [1] "condition_treated_vs_untreated" "condition_untreated_vs_treated"

degs contains 2 elements, one for each contrast/coefficient asked for. It contains the results output in the element raw and the output of lfcShrink in the element shrunken. To obtain the results from one of them, use the method dge:

# the first result set which is -> condition_treated_vs_untreated
deg(degs[[1]])

## log2 fold change (MAP): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 9921 rows and 6 columns
##              baseMean log2FoldChange     lfcSE       stat       pvalue
##             <numeric>      <numeric> <numeric>  <numeric>    <numeric>
## FBgn0039155   730.568       -3.68605 0.1262456   -27.3098 3.24447e-164
## FBgn0025111  1501.448        2.52894 0.1105982    22.7701 9.07165e-115
## FBgn0029167  3706.024       -2.01869 0.0898859   -22.4368 1.72030e-111
## FBgn0003360  4342.832       -2.66235 0.1204731   -22.1466 1.12417e-108
## FBgn0035085   638.219       -2.17784 0.1160933   -18.5777  4.86845e-77
## ...               ...            ...       ...        ...          ...
## FBgn0261356   1.25994     -0.0013631 0.0658556 -0.0206111     0.983556
## FBgn0261393   6.06584      0.0190945 0.1278096  0.1506049     0.880287
## FBgn0261508   3.33369     -0.0381505 0.0968627 -0.4008888     0.688502
## FBgn0261514   1.56905     -0.0767881 0.0730535 -1.1420953     0.253414
## FBgn0261529   3.29260      0.0867154 0.0935156  0.9261367     0.354375
##                     padj
##                <numeric>
## FBgn0039155 2.71919e-160
## FBgn0025111 3.80147e-111
## FBgn0029167 4.80595e-108
## FBgn0003360 2.35542e-105
## FBgn0035085  8.16049e-74
## ...                  ...
## FBgn0261356           NA
## FBgn0261393           NA
## FBgn0261508           NA
## FBgn0261514           NA
## FBgn0261529           NA

By default it would output the shrunken table always, as defined by degDefault, that contains the default table to get. To get the original results table, use the parameter as this:

deg(degs[[1]], "raw", "tibble")

## # A tibble: 9,921 × 7
##    gene        baseMean log2FoldChange  lfcSE  stat    pvalue      padj
##    <chr>          <dbl>          <dbl>  <dbl> <dbl>     <dbl>     <dbl>
##  1 FBgn0039155     731.          -4.62 0.169  -27.3 3.24e-164 2.72e-160
##  2 FBgn0025111    1501.           2.90 0.127   22.8 9.07e-115 3.80e-111
##  3 FBgn0029167    3706.          -2.20 0.0979 -22.4 1.72e-111 4.81e-108
##  4 FBgn0003360    4343.          -3.18 0.144  -22.1 1.12e-108 2.36e-105
##  5 FBgn0035085     638.          -2.56 0.138  -18.6 4.87e- 77 8.16e- 74
##  6 FBgn0039827     262.          -4.16 0.233  -17.9 1.27e- 71 1.78e- 68
##  7 FBgn0034736     226.          -3.51 0.215  -16.3 4.37e- 60 5.23e- 57
##  8 FBgn0029896     490.          -2.44 0.152  -16.1 4.68e- 58 4.90e- 55
##  9 FBgn0000071     342.           2.68 0.182   14.7 7.46e- 49 6.95e- 46
## 10 FBgn0051092     153.           2.33 0.177   13.1 2.07e- 39 1.73e- 36
## # … with 9,911 more rows

Note that the format of the output can be changed to tibble, or data.frame with a third parameter tidy. The table will be always sorted by padj.

And easy way to get significant genes is:

significants(degs, fc = 0, fdr = 0.05, full = TRUE)

## Warning: Unquoting language objects with `!!!` is deprecated as of rlang 0.4.0.
## Please use `!!` instead.
## 
##   # Bad:
##   dplyr::select(data, !!!enquo(x))
## 
##   # Good:
##   dplyr::select(data, !!enquo(x))    # Unquote single quosure
##   dplyr::select(data, !!!enquos(x))  # Splice list of quosures
## 
## This warning is displayed once per session.

## # A tibble: 841 × 7
##    gene        log2FoldChange     padj log2FoldChange_cond… log2FoldChange_cond…
##    <chr>                <dbl>    <dbl>                <dbl>                <dbl>
##  1 FBgn0000043          0.544 1.87e- 5                0.544               -0.579
##  2 FBgn0000064          0.273 4.17e- 2                0.273               -0.285
##  3 FBgn0000071          2.06  6.95e-46                2.06                -2.33 
##  4 FBgn0000079          0.675 2.75e- 2               -0.490                0.675
##  5 FBgn0000116          0.974 7.74e-10                0.974               -1.11 
##  6 FBgn0000139          0.381 5.82e- 3               -0.361                0.381
##  7 FBgn0000146          0.449 3.54e- 5                0.449               -0.469
##  8 FBgn0000147          0.356 3.02e- 2                0.356               -0.385
##  9 FBgn0000244          0.367 3.33e- 2               -0.341                0.367
## 10 FBgn0000256          0.418 7.95e- 3               -0.390                0.418
## # … with 831 more rows, and 2 more variables:
## #   padj_condition_treated_vs_untreated <dbl>,
## #   padj_condition_untreated_vs_treated <dbl>

Gene plots

Plot top genes coloring by group.

# example
degPlot(dds = dds, res = res, n = 6, xs = "condition")

## No genes were mapped to rowData. check ann parameter values.

## Using gene as id variables

## `geom_smooth()` using formula 'y ~ x'

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Additional questions

How about try to obtain these genes without additional packages?

Can you convert res into tibble?

rownames_to_column(as.data.frame(res), var ="Gene") %>% as_tibble()

## # A tibble: 9,921 × 7
##    Gene        baseMean log2FoldChange lfcSE    stat      pvalue       padj
##    <chr>          <dbl>          <dbl> <dbl>   <dbl>       <dbl>      <dbl>
##  1 FBgn0000008    95.1         0.00228 0.224  0.0102 0.992        0.997    
##  2 FBgn0000014     1.06       -0.495   2.14  -0.231  0.817       NA        
##  3 FBgn0000017  4353.         -0.240   0.126 -1.90   0.0576       0.288    
##  4 FBgn0000018   419.         -0.105   0.148 -0.705  0.481        0.827    
##  5 FBgn0000024     6.41        0.211   0.690  0.306  0.760        0.944    
##  6 FBgn0000032   990.         -0.0918  0.148 -0.621  0.534        0.854    
##  7 FBgn0000037    14.1         0.463   0.490  0.944  0.345        0.744    
##  8 FBgn0000042 82208.          0.315   0.141  2.24   0.0253       0.167    
##  9 FBgn0000043 31162.          0.620   0.124  5.01   0.000000540  0.0000187
## 10 FBgn0000044    27.0         0.415   0.413  1.00   0.315        0.720    
## # … with 9,911 more rows

Can you arrange and finid the most significantly DEG?
Can you find the expression of the most significantly DEG?
Can you plot the expression of the most significantly DEG?

Optional: using ReportingTools in an Analysis of RNA-seq Data

One of the advantages of R is that there are a lot of well-written packages that reuse other packages to make your life easier.

BiocManager::install("ReportingTools")

## Bioconductor version 3.12 (BiocManager 1.30.16), R 4.0.4 (2021-02-15)

## Installing package(s) 'ReportingTools'

## also installing the dependencies 'jpeg', 'checkmate', 'htmlwidgets', 'BiocFileCache', 'Formula', 'latticeExtra', 'htmlTable', 'viridis', 'dichromat', 'Rhtslib', 'biomaRt', 'ProtGenerics', 'lazyeval', 'graph', 'RBGL', 'GO.db', 'AnnotationForge', 'Rgraphviz', 'gridExtra', 'reshape2', 'Hmisc', 'biovizBase', 'Biostrings', 'Rsamtools', 'GenomicAlignments', 'BSgenome', 'VariantAnnotation', 'rtracklayer', 'GenomicFeatures', 'OrganismDbi', 'GGally', 'ensembldb', 'AnnotationFilter', 'hwriter', 'Category', 'GOstats', 'PFAM.db', 'GSEABase', 'ggbio'

## 
## The downloaded binary packages are in
##  /var/folders/lv/x6hj85cn66bgmdxq_jtwjmzr0000gn/T//RtmpC0RzbK/downloaded_packages

## installing the source packages 'GO.db', 'PFAM.db'

library("ReportingTools")

## Loading required package: knitr

##

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

##

des2Report <- HTMLReport(shortName = 'RNAseq_analysis_with_DESeq2',
title = 'RNA-seq analysis of differential expression using DESeq2',
reportDirectory = "./reports")

## Get a DESeqDataSet object
mockRna.dse <- DESeq(dds)

## using pre-existing size factors

## estimating dispersions

## found already estimated dispersions, replacing these

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

colData(mockRna.dse)$conditions <- dds$condition

# this will take a while
publish(mockRna.dse,des2Report, pvalueCutoff=0.05,
annotation.db="fly.db0", factor = colData(mockRna.dse)$conditions,
reportDir="./reports")
finish(des2Report)

## [1] "./reports/RNAseq_analysis_with_DESeq2.html"

More analysis to try!

There are many, many resources online for you to try. If you have run through all the exercises, maybe try to go through DGE_workshop designed by Harvard Chan Bioinformatics Core.

If you want to see how more information can be retrieved from the res object, then maybe try to go through lession 6 but using our data (res, dds)