Introduction
Pretty much most of the information are obtained from:
as well as some more detailed explanations in
Transcriptomics module from Helminth Bioinformatics (Asia), 2021
Another more comprehensive paper on RNAseq analysis is Dundal et al’s Introduction to differential gene expression analysis using RNA-seq
One of the most common uses of transcriptomic data is possibly for differential gene expression study, which will be covered in this course. However, the extensive and high-throughput nature of the transcriptomic data means there are other potential usages. For example, it can be used to profile total RNA (e.g. miRNA and mRNA) in exosomes and other secretory products; help identify different splice isoforms; provide evidence for gene annotation and improve quality of reference genomes. Meta-transcriptome (combining and re-analysing pool of transcriptomics data from multiple experiments), and comparative gene expression between species could be seen as an extension of differential gene expression. Furthermore, genetic variation particularly SNP calling could use information from transcriptomics data which would carry SNPs from transcribed genes.
At the end of the practical, ou will have hands-on experience in visualising transcriptomic profiles and using R packages to identify differentially expressed genes and finding patterns in the data.
Installation
You need to first install the packages successfully for the whole tutorial to work
# Takes about 10 minutes
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("pasilla")
BiocManager::install("DESeq2")
BiocManager::install("apeglm")
BiocManager::install("DEGreport")
BiocManagerinstall.packages("pheatmap")
install.packages("tidyverse")
Initiate libraries
library("pasilla")
library("tidyverse")
library("DESeq2")
library("pheatmap")
library("RColorBrewer")
library("apeglm")
library("DEGreport")
#Note: In some versions of Mac DEGreport will not be loaded successfully,
#Try install Xquarz, restart Mac and R studio, and try to load again
# https://www.xquartz.org/
About DESeq2
This is an R package for performing differential expression analysis (PMID: 25516281; last time I checked it’s been cited 30k times!). It can take read count data in various forms, one of those is read count tables from HTSeq-count. The tool provides simple command lines for formatting read count data, normalization, exploring variances between samples, and performing differential expression analysis. It is one of the tools widely used for RNA-seq data analysis and it also provide detailed manual which help make it more user-friendly (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
Alternative softwares
In addition to DESeq2, there are a variety of programs for detecting differentially expressed genes from tables of RNA-seq read counts. Some of these tools work in R, while some require Unix interface. Examples of these tools include EdgeR (PMID: 19910308), BaySeq (PMID: 20698981), Cuffdiff (PMID: 23222703), Sleuth PMID: 28581496 (an accompanying tool for read count data from Kallisto).
Data import
To demonstate the use of DESeqDataSetFromMatrix
, we will read in count data from the pasilla package. We read in a count matrix, which we will name cts
, and the sample information table, which we will name coldata
.
library("pasilla")
<- system.file("extdata",
pasCts "pasilla_gene_counts.tsv",
package="pasilla", mustWork=TRUE)
<- system.file("extdata",
pasAnno "pasilla_sample_annotation.csv",
package="pasilla", mustWork=TRUE)
<- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
cts <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
coldata $condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type) coldata
What does the original data look like?
# gene raw counts
as_tibble(cts)
## # A tibble: 14,599 × 7
## untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 1
## 2 92 161 76 70 140 88 70
## 3 5 1 0 0 4 0 0
## 4 0 2 1 2 1 0 0
## 5 4664 8714 3564 3150 6205 3072 3334
## 6 583 761 245 310 722 299 308
## 7 0 1 0 0 0 0 0
## 8 10 11 3 3 10 7 5
## 9 0 1 0 0 0 1 1
## 10 1446 1713 615 672 1698 696 757
## # … with 14,589 more rows
#metadata
coldata
## condition type
## treated1fb treated single-read
## treated2fb treated paired-end
## treated3fb treated paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated paired-end
## untreated4fb untreated paired-end
Note that these are not in the same order with respect to samples!
It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.
rownames(coldata)
## [1] "treated1fb" "treated2fb" "treated3fb" "untreated1fb" "untreated2fb"
## [6] "untreated3fb" "untreated4fb"
colnames(cts)
## [1] "untreated1" "untreated2" "untreated3" "untreated4" "treated1"
## [6] "treated2" "treated3"
# first check if the row names in coldata are the same as the column names of cts data
# Should return the result false
all(rownames(coldata) %in% colnames(cts))
## [1] FALSE
#We need to chop off the "fb" of the row names of coldata, so the naming is consistent.
rownames(coldata) <- sub("fb", "", rownames(coldata))
#check again
all(rownames(coldata) %in% colnames(cts))
## [1] TRUE
As the names of columns are not in the correct order as given, we need to re-arrange one or the other so that they are consistent in terms of sample order (if we do not, later functions would produce an error).
# check if order is the same
all(rownames(coldata) == colnames(cts))
## [1] FALSE
# reorder cts's columns based on row order of metadata (coldata)
<- cts[, rownames(coldata)]
cts all(rownames(coldata) == colnames(cts))
## [1] TRUE
With the count matrix, cts
, and the sample information, coldata
, we can construct a DESeqDataSet:
<- DESeqDataSetFromMatrix(countData = cts,
dds colData = coldata,
design = ~ condition)
dds
## class: DESeqDataSet
## dim: 14599 7
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type
If you have additional feature data, it can be added to the DESeqDataSet
by adding to the metadata columns of a newly constructed object. (Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds
.)
<- data.frame(gene=rownames(cts))
featureData mcols(dds) <- DataFrame(mcols(dds), featureData)
mcols(dds)
## DataFrame with 14599 rows and 1 column
## gene
## <character>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ... ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575
Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds
data object, and we increase the speed of the transformation and testing functions within DESeq2. Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total. Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results
function.
<- rowSums(counts(dds)) >= 10
keep <- dds[keep,] dds
Note on factor levels
By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast
argument (this will be shown later), or you can explicitly set the factors levels. In order to see the change of reference levels reflected in the results names, you need to either run DESeq or nbinomWaldTest/nbinomLRT after the re-leveling operation. Setting the factor levels can be done in two ways, either using factor:
$condition <- factor(dds$condition, levels = c("untreated","treated")) dds
…or using relevel, just specifying the reference level:
$condition <- relevel(dds$condition, ref = "untreated") dds
About the pasilla dataset
We continue with the pasilla data constructed from the count matrix method above. This data set is from an experiment on Drosophila melanogaster cell cultures and investigated the effect of RNAi knock-down of the splicing factor pasilla (Brooks et al. 2011). The detailed transcript of the production of the pasilla data is provided in the vignette of the data package pasilla
Differential expression analysis
The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described below, in the manual page for ?DESeq
and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).
Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.
Details about the comparison are printed to the console, directly above the results table. The text, condition treated vs untreated
, tells you that the estimates are of the logarithmic fold change log2(treated/untreated).
In its simplest view, differential expression analysis test for differences between 2 or more groups when we have one or more factors that could contribute to the observed differences. However, in transcriptomic data, we often have a much lower number of replicates, data are not normally distributed, and we are testing differences in many genes at the same time. Methods and tools for analysing differential gene expression; therefore, have some specific terminology and concepts as we shall see next!
Genes differentially expressed between conditions
Using the DESeq dds object we created earlier, we can look at the differentially expressed genes using results() function. By default, DESeq2 perform pair-wise comparison of the first and the last variable in the experimental design variables and provide a result table.
## probably the most important command :)
<- DESeq(dds)
dds
<- results(dds)
res res
## log2 fold change (MLE): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 9921 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0000008 95.14429 0.00227644 0.223729 0.010175 0.9918817 0.997211
## FBgn0000014 1.05652 -0.49512039 2.143186 -0.231021 0.8172987 NA
## FBgn0000017 4352.55357 -0.23991894 0.126337 -1.899041 0.0575591 0.288002
## FBgn0000018 418.61048 -0.10467391 0.148489 -0.704927 0.4808558 0.826834
## FBgn0000024 6.40620 0.21084779 0.689588 0.305759 0.7597879 0.943501
## ... ... ... ... ... ... ...
## FBgn0261570 3208.38861 0.2955329 0.127350 2.3206264 0.020307 0.144240
## FBgn0261572 6.19719 -0.9588230 0.775315 -1.2366888 0.216203 0.607848
## FBgn0261573 2240.97951 0.0127194 0.113300 0.1122634 0.910615 0.982657
## FBgn0261574 4857.68037 0.0153924 0.192567 0.0799327 0.936291 0.988179
## FBgn0261575 10.68252 0.1635705 0.930911 0.1757102 0.860522 0.967928
The result table contains several columns, of which the most relevant are:
- Rowname: indicating gene id
- Column 1: baseMean, average expression level across all samples normalised by sequencing depth
- Column 2: log2FoldChange, in this table above, of treated vs untreated
- Column 6: padj, adjusted p-value, p-value corrected for multiple testing
Note that we could have specified the coefficient or contrast we want to build a results table for, using either of the following equivalent commands:
<- results(dds, name="condition_treated_vs_untreated")
res <- results(dds, contrast=c("condition","treated","untreated")) res
Log fold change shrinkage for visualization and ranking
Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. To shrink the LFC, we pass the dds object to the function lfcShrink
. Below we specify to use the apeglm method for effect size shrinkage (Zhu, Ibrahim, and Love 2018), which improves on the previous estimator.
We provide the dds object and the name or number of the coefficient we want to shrink, where the number refers to the order of the coefficient as it appears in resultsNames(dds)
.
resultsNames(dds)
## [1] "Intercept" "condition_treated_vs_untreated"
<- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
resLFC # noticed any differences to the values in lfcSE compared to res ?
resLFC
## log2 fold change (MAP): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 9921 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0000008 95.14429 0.00119920 0.151897 0.9918817 0.997211
## FBgn0000014 1.05652 -0.00473412 0.205468 0.8172987 NA
## FBgn0000017 4352.55357 -0.18989990 0.120377 0.0575591 0.288002
## FBgn0000018 418.61048 -0.06995753 0.123901 0.4808558 0.826834
## FBgn0000024 6.40620 0.01752715 0.198633 0.7597879 0.943501
## ... ... ... ... ... ...
## FBgn0261570 3208.38861 0.24110290 0.1244469 0.020307 0.144240
## FBgn0261572 6.19719 -0.06576173 0.2141351 0.216203 0.607848
## FBgn0261573 2240.97951 0.01000619 0.0993764 0.910615 0.982657
## FBgn0261574 4857.68037 0.00843552 0.1408267 0.936291 0.988179
## FBgn0261575 10.68252 0.00809101 0.2014704 0.860522 0.967928
p-values and adjusted p-values
We can order our results table by the smallest p value:
<- res[order(res$pvalue),] resOrdered
We can summarize some basic tallies using the summary
function.
summary(res)
##
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 518, 5.2%
## LFC < 0 (down) : 536, 5.4%
## outliers [1] : 1, 0.01%
## low counts [2] : 1539, 16%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results
. Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha
. By default the argument alpha is set to 0.1. If the adjusted p value cutoff will be a value other than 0.1, alpha should be set to that value:
<- results(dds, alpha=0.05)
res05 summary(res05)
##
## out of 9921 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 407, 4.1%
## LFC < 0 (down) : 431, 4.3%
## outliers [1] : 1, 0.01%
## low counts [2] : 1347, 14%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
sum(res05$padj < 0.05, na.rm=TRUE)
## [1] 838
More exaplanation: Log2 fold change
Fold change is calculated from a ratio of normalised read counts between two conditions of interest. However, level of gene expression changes are often shown as log2 fold change. Using log2 value become particularly helpful for visualising the gene expression changes. Furthermore, it eventually become intuitive that log2FC of 1 means expression level double in a given condition, and a negative log2FC means the gene is down-regulated in a given condition.
More exaplanation: P-values, q-values and multiple-testing problem
We might be familiar with using p-value as a statistically significant cut-off (such as at p-value = 0.05) when you study one gene. Essentially, p-value of 0.05 means there is a 5% chance that the difference observed is due to chance (false-positive). Now, when we talk about transcriptome , we can be looking at 10,000 genes at once and the chance of false-positive hits, collectively, then become 0.05 * 10,000 = 500 hits! That is too much to obtain reliable results, and this is known as a multiple-testing problem - the more you carry out a test, the more likely you would end up with some positive results just by chance. To overcome this, we will, instead, use q-value, also known as adjusted p-value, which can be calculated from the p-value, its distribution, and taking the number of tests into account. Thankfully, we often don’t need to do the calculation ourselves, as data analysis tools often provide the adjusted p-value. Therefore it is an adjusted p-value of less than 0.05 (or other cut-off) that we should be looking for when asking whether a gene is differentially expressed.
Exploring and exporting results
MA-plot
In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.
plotMA(res, ylim=c(-2,2))
It is more useful visualize the MA-plot for the shrunken log2 fold changes, which remove the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds.
plotMA(resLFC, ylim=c(-2,2))
Plot counts
It can also be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts
, which normalizes counts by sequencing depth and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup
, where more than one variable can be specified. Here we specify the gene which had the smallest p value from the results table created above. You can select the gene to plot by rowname or by numeric index.
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
For customized plotting, an argument returnData
specifies that the function should only return a data.frame for plotting with ggplot.
<- plotCounts(dds, gene=which.min(res$padj), intgroup="condition",
d returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=condition, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))
More information on results columns
Information about which variables and tests were used can be found by calling the function mcols
on the results object.
mcols(res)$description
## [1] "mean of normalized counts for all samples"
## [2] "log2 fold change (MLE): condition treated vs untreated"
## [3] "standard error: condition treated vs untreated"
## [4] "Wald statistic: condition treated vs untreated"
## [5] "Wald test p-value: condition treated vs untreated"
## [6] "BH adjusted p-values"
For a particular gene, a log2 fold change of -1 for condition treated vs untreated means that the treatment induces a multiplicative change in observed gene expression level of 2−1=0.5
compared to the untreated condition. If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable.
Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:
- If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
- If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance.
- If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA.
Exporting results to CSV files
A plain-text file of the results can be exported using the base R functions write.csv
or write.delim
. We suggest using a descriptive file name indicating the variable and levels which were tested.
write.csv(as.data.frame(resOrdered),
file="condition_treated_results.csv")
Exporting only the results which pass an adjusted p value threshold can be accomplished with the subset function, followed by the write.csv
function.
<- subset(resOrdered, padj < 0.1)
resSig resSig
## log2 fold change (MLE): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 1054 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0039155 730.568 -4.61874 0.1691240 -27.3098 3.24447e-164
## FBgn0025111 1501.448 2.89995 0.1273576 22.7701 9.07165e-115
## FBgn0029167 3706.024 -2.19691 0.0979154 -22.4368 1.72030e-111
## FBgn0003360 4342.832 -3.17954 0.1435677 -22.1466 1.12417e-108
## FBgn0035085 638.219 -2.56024 0.1378126 -18.5777 4.86845e-77
## ... ... ... ... ... ...
## FBgn0037073 973.1016 -0.252146 0.1009872 -2.49681 0.0125316
## FBgn0029976 2312.5885 -0.221127 0.0885764 -2.49645 0.0125443
## FBgn0030938 24.8064 0.957645 0.3836454 2.49617 0.0125542
## FBgn0039260 1088.2766 -0.259253 0.1038739 -2.49585 0.0125656
## FBgn0034753 7775.2711 0.393515 0.1576749 2.49574 0.0125696
## padj
## <numeric>
## FBgn0039155 2.71919e-160
## FBgn0025111 3.80147e-111
## FBgn0029167 4.80595e-108
## FBgn0003360 2.35542e-105
## FBgn0035085 8.16049e-74
## ... ...
## FBgn0037073 0.0999489
## FBgn0029976 0.0999489
## FBgn0030938 0.0999489
## FBgn0039260 0.0999489
## FBgn0034753 0.0999489
Data transformations and visualizationå
Count data transformations
In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data.
Extracting transformed values
These transformation functions return an object of class DESeqTransform which is a subclass of RangedSummarizedExperiment. For ~20 samples, running on a newly created DESeqDataSet
. The assay
function is used to extract the matrix of normalized values.
<- vst(dds)
vsd head(assay(vsd), 3)
## treated1 treated2 treated3 untreated1 untreated2 untreated3
## FBgn0000008 7.207377 7.486014 7.191410 7.156899 7.249802 7.497780
## FBgn0000014 5.520677 5.142084 5.142084 5.648302 5.323248 5.142084
## FBgn0000017 11.915971 12.003507 11.992355 12.024976 12.267055 12.440309
## untreated4
## FBgn0000008 7.283112
## FBgn0000014 5.142084
## FBgn0000017 12.057108
Data quality assessment by sample clustering and visualization
Data quality assessment and quality control (i.e. the removal of insufficiently good data) are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.
We define the term quality as fitness for purpose. Our purpose is the detection of differentially expressed genes, and we are looking in particular for samples whose experimental treatment suffered from an anormality that renders the data points obtained from these particular samples detrimental to our purpose.
Gene heatmap We may want to look at expression profiles of multiple genes at the same time. Heatmaps can be useful for this purpose; it essentially help turn a table of numbers into a mode visual form and it is versatile. The table could be normalised counts of top differentially expressed genes in a given comparison, or it could be genes known to be involved in a specific pathway, or it can be log2FC values instead of read counts. Previously, we use heatmap to visualise matrix of distances between each samples.
<- order(rowMeans(counts(dds,normalized=TRUE)),
select decreasing=TRUE)[1:20]
<- as.data.frame(colData(dds)[,c("condition","type")])
df pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
Heatmap of the sample-to-sample distances
Another use of the transformed data is sample clustering. Here, we apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances.
<- dist(t(assay(vsd))) sampleDists
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. We have to provide a hierarchical clustering hc to the heatmap function based on the sample distances, or else the heatmap function would calculate a clustering based on the distances between the rows/columns of the distance matrix.
<- as.matrix(sampleDists)
sampleDistMatrix rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
<- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
colors pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
Principal component plot of the samples
Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.
The Principal Components Analysis (PCA) plot shows the relationship between the samples in two dimensions. Hopefully, technical and biological replicates would show similar expression patterns (i.e. they are grouped together tightly on the PCA plot); whereas, samples from different experimental conditions, or distinctive time points, would form separate groups. In some cases we can identify outliers, e.g. samples which do not agree with other replicates and these can be excluded. If we don’t have many replicates, it is hard to detect outliers and our power to detect differentially expressed genes is reduced.
I also highly recommend this short video (Principal Component Analysis (PCA) clearly explained (2015)) to help understand the basis of the approach.
plotPCA
is a function in DESeq2 package. R also have internal PCA function, but the usage is more complicated intgroup indicate how we want to present each data point on the plot. In this case, we label data points based on their groups under the column condition
and type
of the meta data
plotPCA(vsd, intgroup=c("condition", "type"))
r4444s13g6yqxe一誒ㄐㄤxe一誒ㄐㄤ
Additional analysis
There are many packages that can work with DEseq2 objects. We will use DEGreport
as an example for additional plot
QCs, figures and analyses after differential expression with DESeq2 or other similar tool
Contrasts
DEGSet
is a class to store the DE results like the one from results
function. DESeq2 offers multiple way to ask for contrasts/coefficients. With degComps is easy to get multiple results in a single object:
<- degComps(dds, combs = "condition",
degs contrast = list( c("condition", "untreated", "treated") ))
names(degs)
## [1] "condition_treated_vs_untreated" "condition_untreated_vs_treated"
degs
contains 2 elements, one for each contrast/coefficient asked for. It contains the results output in the element raw and the output of lfcShrink
in the element shrunken
. To obtain the results from one of them, use the method dge
:
# the first result set which is -> condition_treated_vs_untreated
deg(degs[[1]])
## log2 fold change (MAP): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 9921 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0039155 730.568 -3.68605 0.1262456 -27.3098 3.24447e-164
## FBgn0025111 1501.448 2.52894 0.1105982 22.7701 9.07165e-115
## FBgn0029167 3706.024 -2.01869 0.0898859 -22.4368 1.72030e-111
## FBgn0003360 4342.832 -2.66235 0.1204731 -22.1466 1.12417e-108
## FBgn0035085 638.219 -2.17784 0.1160933 -18.5777 4.86845e-77
## ... ... ... ... ... ...
## FBgn0261356 1.25994 -0.0013631 0.0658556 -0.0206111 0.983556
## FBgn0261393 6.06584 0.0190945 0.1278096 0.1506049 0.880287
## FBgn0261508 3.33369 -0.0381505 0.0968627 -0.4008888 0.688502
## FBgn0261514 1.56905 -0.0767881 0.0730535 -1.1420953 0.253414
## FBgn0261529 3.29260 0.0867154 0.0935156 0.9261367 0.354375
## padj
## <numeric>
## FBgn0039155 2.71919e-160
## FBgn0025111 3.80147e-111
## FBgn0029167 4.80595e-108
## FBgn0003360 2.35542e-105
## FBgn0035085 8.16049e-74
## ... ...
## FBgn0261356 NA
## FBgn0261393 NA
## FBgn0261508 NA
## FBgn0261514 NA
## FBgn0261529 NA
By default it would output the shrunken
table always, as defined by degDefault
, that contains the default table to get. To get the original results table, use the parameter as this:
deg(degs[[1]], "raw", "tibble")
## # A tibble: 9,921 × 7
## gene baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FBgn0039155 731. -4.62 0.169 -27.3 3.24e-164 2.72e-160
## 2 FBgn0025111 1501. 2.90 0.127 22.8 9.07e-115 3.80e-111
## 3 FBgn0029167 3706. -2.20 0.0979 -22.4 1.72e-111 4.81e-108
## 4 FBgn0003360 4343. -3.18 0.144 -22.1 1.12e-108 2.36e-105
## 5 FBgn0035085 638. -2.56 0.138 -18.6 4.87e- 77 8.16e- 74
## 6 FBgn0039827 262. -4.16 0.233 -17.9 1.27e- 71 1.78e- 68
## 7 FBgn0034736 226. -3.51 0.215 -16.3 4.37e- 60 5.23e- 57
## 8 FBgn0029896 490. -2.44 0.152 -16.1 4.68e- 58 4.90e- 55
## 9 FBgn0000071 342. 2.68 0.182 14.7 7.46e- 49 6.95e- 46
## 10 FBgn0051092 153. 2.33 0.177 13.1 2.07e- 39 1.73e- 36
## # … with 9,911 more rows
Note that the format of the output can be changed to tibble, or data.frame with a third parameter tidy
. The table will be always sorted by padj.
And easy way to get significant genes is:
significants(degs, fc = 0, fdr = 0.05, full = TRUE)
## Warning: Unquoting language objects with `!!!` is deprecated as of rlang 0.4.0.
## Please use `!!` instead.
##
## # Bad:
## dplyr::select(data, !!!enquo(x))
##
## # Good:
## dplyr::select(data, !!enquo(x)) # Unquote single quosure
## dplyr::select(data, !!!enquos(x)) # Splice list of quosures
##
## This warning is displayed once per session.
## # A tibble: 841 × 7
## gene log2FoldChange padj log2FoldChange_cond… log2FoldChange_cond…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 FBgn0000043 0.544 1.87e- 5 0.544 -0.579
## 2 FBgn0000064 0.273 4.17e- 2 0.273 -0.285
## 3 FBgn0000071 2.06 6.95e-46 2.06 -2.33
## 4 FBgn0000079 0.675 2.75e- 2 -0.490 0.675
## 5 FBgn0000116 0.974 7.74e-10 0.974 -1.11
## 6 FBgn0000139 0.381 5.82e- 3 -0.361 0.381
## 7 FBgn0000146 0.449 3.54e- 5 0.449 -0.469
## 8 FBgn0000147 0.356 3.02e- 2 0.356 -0.385
## 9 FBgn0000244 0.367 3.33e- 2 -0.341 0.367
## 10 FBgn0000256 0.418 7.95e- 3 -0.390 0.418
## # … with 831 more rows, and 2 more variables:
## # padj_condition_treated_vs_untreated <dbl>,
## # padj_condition_untreated_vs_treated <dbl>
Gene plots
Plot top genes coloring by group.
# example
degPlot(dds = dds, res = res, n = 6, xs = "condition")
## No genes were mapped to rowData. check ann parameter values.
## Using gene as id variables
## `geom_smooth()` using formula 'y ~ x'
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
Additional questions
How about try to obtain these genes without additional packages?
- Can you convert
res
into tibble?
rownames_to_column(as.data.frame(res), var ="Gene") %>% as_tibble()
## # A tibble: 9,921 × 7
## Gene baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FBgn0000008 95.1 0.00228 0.224 0.0102 0.992 0.997
## 2 FBgn0000014 1.06 -0.495 2.14 -0.231 0.817 NA
## 3 FBgn0000017 4353. -0.240 0.126 -1.90 0.0576 0.288
## 4 FBgn0000018 419. -0.105 0.148 -0.705 0.481 0.827
## 5 FBgn0000024 6.41 0.211 0.690 0.306 0.760 0.944
## 6 FBgn0000032 990. -0.0918 0.148 -0.621 0.534 0.854
## 7 FBgn0000037 14.1 0.463 0.490 0.944 0.345 0.744
## 8 FBgn0000042 82208. 0.315 0.141 2.24 0.0253 0.167
## 9 FBgn0000043 31162. 0.620 0.124 5.01 0.000000540 0.0000187
## 10 FBgn0000044 27.0 0.415 0.413 1.00 0.315 0.720
## # … with 9,911 more rows
- Can you
arrange
and finid the most significantly DEG? - Can you find the expression of the most significantly DEG?
- Can you plot the expression of the most significantly DEG?
Optional: using ReportingTools in an Analysis of RNA-seq Data
One of the advantages of R is that there are a lot of well-written packages that reuse other packages to make your life easier.
::install("ReportingTools") BiocManager
## Bioconductor version 3.12 (BiocManager 1.30.16), R 4.0.4 (2021-02-15)
## Installing package(s) 'ReportingTools'
## also installing the dependencies 'jpeg', 'checkmate', 'htmlwidgets', 'BiocFileCache', 'Formula', 'latticeExtra', 'htmlTable', 'viridis', 'dichromat', 'Rhtslib', 'biomaRt', 'ProtGenerics', 'lazyeval', 'graph', 'RBGL', 'GO.db', 'AnnotationForge', 'Rgraphviz', 'gridExtra', 'reshape2', 'Hmisc', 'biovizBase', 'Biostrings', 'Rsamtools', 'GenomicAlignments', 'BSgenome', 'VariantAnnotation', 'rtracklayer', 'GenomicFeatures', 'OrganismDbi', 'GGally', 'ensembldb', 'AnnotationFilter', 'hwriter', 'Category', 'GOstats', 'PFAM.db', 'GSEABase', 'ggbio'
##
## The downloaded binary packages are in
## /var/folders/lv/x6hj85cn66bgmdxq_jtwjmzr0000gn/T//RtmpC0RzbK/downloaded_packages
## installing the source packages 'GO.db', 'PFAM.db'
library("ReportingTools")
## Loading required package: knitr
##
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
<- HTMLReport(shortName = 'RNAseq_analysis_with_DESeq2',
des2Report title = 'RNA-seq analysis of differential expression using DESeq2',
reportDirectory = "./reports")
## Get a DESeqDataSet object
<- DESeq(dds) mockRna.dse
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
colData(mockRna.dse)$conditions <- dds$condition
# this will take a while
publish(mockRna.dse,des2Report, pvalueCutoff=0.05,
annotation.db="fly.db0", factor = colData(mockRna.dse)$conditions,
reportDir="./reports")
finish(des2Report)
## [1] "./reports/RNAseq_analysis_with_DESeq2.html"
More analysis to try!
There are many, many resources online for you to try. If you have run through all the exercises, maybe try to go through DGE_workshop designed by Harvard Chan Bioinformatics Core.
If you want to see how more information can be retrieved from the res
object, then maybe try to go through lession 6 but using our data (res
, dds
)