Introduction

The code contained in this .Rmd document performs differential expression (DE) and enrichment analysis. It is performed after processing raw sequencing reads to analysis ready raw counts, e.g. with nfcore/rnaseq as we did in Part 1 of this workshop.

How to use this file:

Set up

Yes, more set up!

Load the R-libraries

Load all the R libraries below.

suppressMessages({
library("DESeq2")
library("edgeR")
library("limma")
library("RColorBrewer")
library("gplots")
library("ggplot2")
library("factoextra")
library("devtools")
library("rstudioapi")
library("dplyr")
library("tibble")
library("tidyverse")
library("pheatmap")
library("biomaRt")
library("annotables")
library("org.Mm.eg.db")
library("biobroom")
library("clusterProfiler")
library("pathfindR")
})

Install pheatmap: There is one package (pheatmap) which has not been installed in our singularity container (whoopsie!). You should see a yellow banner at the top of this page warning you about this - please click install

Set the current working directory

Set the path so that R knows where our data lives and where we will write our output files.

setwd("/home/training/base_directory/working_directory")

Load the count matrix file

The count matrix file contains raw (unnormalized) counts for every gene (rows, i) in mm10 and every sample (columns, j). The value in the i-th row and the j-th column of the matrix tells how many raw reads were assigned to gene i in sample j.

We created a count matrix file with subsetted data in Part 1 of this workshop using nf-core/rnaseq (salmon.merged.gene_counts.tsv). In Part 2, we will use a count matrix file produced from the full dataset (including more descriptive sample identifiers!), in order to have enough data to perform functional enrichment analysis.

Load the count matrix file and format the data frame:

# Read in the full count matrix file
# This file was included in the data we downloaded for Part 1
counttable_original <- read.delim("FULL_count_matrix.txt", 
                                header = T, 
                                row.names = 1)

# This is not in the exact format required by the functions used later in the analysis
# Data format is very important to ensure that functions read and analyse data correctly!

# Put gene symbol in the first column
counttable <- counttable_original[,c("Symbol","WT1","WT2","WT3","KO1","KO2","KO3")]

# We don't need the ensembl IDs - get rid of the rownames
row.names(counttable) <- NULL

# Make the gene symbol column rownames instead
rownames(counttable) <- counttable$Symbol
counttable <- counttable[,c("WT1","WT2","WT3","KO1","KO2","KO3")]

We now have our count matrix file (genes - rows, samples - columns) ready for analysis.

In the console, inspect counttable

Experimental design metadata

DE requires some metadata that tells our R libraries about the experimental design of the study, so that it knows how to handle the data. In this analysis, we have two experimental groups, the wildtype (“Wild”) and the knockout (“KO”) groups.

We will create and store this metadata in a specific format required by the R libraries that we will use later. The samples are in rows (sample IDs as rownames), and columns are the experimental groupings. You can have more than one column, but need a minimum of one that describes your experimental groups.

# Define a condition variable, ensuring they match the order of sample IDs in counttable
condition = c("Wild","Wild","Wild","KO","KO","KO")

# Create a dataframe called meta with condition and sample IDs as rownames (taken from counttable)
meta <- data.frame(row.names = colnames(counttable), condition)

In the console, inspect meta

Exploratory analysis

Before performing any DE analysis, it is good to explore and visualise our data. This helps us to understand and get a sense of the quality of our data at this stage of the analysis.

This session will be done in breakout rooms. Work through the code and challenge questions with your facilitators.

Total library size

For each sample, check the total library size. This is essentially the total number of reads which we have aligned to each sample.

# Sum raw gene counts for each sample (each column)
colSums(counttable)
##      WT1      WT2      WT3      KO1      KO2      KO3 
## 25726848 20480345 20655220 18571112 28027328 20111698

How do you think the differences in total library size could affect DE analysis?

Raw data distribution

Here we plot a boxplot of gene level raw counts (y axis) for each sample (x axis).

boxplot(counttable,
        col = "red")

How would you describe what you are seeing in the plot?

# Add 1 to make sure all values are > 0
boxplot(log2((counttable) +1),
        col = "red")

The distribution of counts across samples is not comparable (although not too dissimilar in this case!). This is a consideration to take if you plan to use any statistical tests that assume an equal distribution of counts across samples.

DESeq2

The DESeq2 package contains functions that perform normalization, data transformation, visualization and DE. This is a highly regarded and popular tool. We will use some of its functions to perform exploratory analysis.

This is a large package and we will only scratch the surface of key concepts in this workshop. We recommend you read the DESeq2 paper and manual before performing your own analysis.

Experimental design and the DESeqDataSet object

In order for DESeq2 to perform DE, we need to store data in a DESeqDataSet object (dds) which contains:

  • Our count matrix file
  • Our experimental information (“meta”) file
  • Our design formula

For exploratory analysis, we set design = ~ 1 which tells DESeq2 to be blind to experimental groups. We do not want DESeq2 to account for any within group variability during exploratory analysis and quality checking. This will allow us to observe for any unexpected batch effects.

We will spend more time understanding the dds object later in this workshop.

# We will call this object by name 'dds' as this is a standard practice
dds <- DESeqDataSetFromMatrix(countData = counttable, 
                              colData = meta, 
                              design = ~1)

Data transformation

Count data is transformed with regularized log (rlog) or variance stabilising transformation (vst), required before performing exploratory data analysis such as visualisation and clustering (e.g. PCA). Both methods produce data on the log2 scale, and normalize for other factors such as library size.

rlog performs slightly better, but can take a lot longer than vst if you have many samples. We will set blind = TRUE so that DESeq2 is blind to experimental groups, for the same reasons as previously described.

# Calculate rlog and store it in a dds-like object
rlog <- rlog(dds, blind = TRUE)
rlog.data <- assay(rlog)

We can check the effect of transformation by again using boxplots.

boxplot(rlog.data,
        col = "red")

Notice that the count distribution across samples is much more comparable with rlog transformed data.

Principal component analysis

Principal component analysis (PCA) is a dimensionality reduction method that summarises high dimensionality data into a number of principal components (PCs). For RNA-Seq, our highly dimensional data is our per sample gene-expression data and the variance that exists across samples. We can observe the relationship between these in a 2D space, by plotting two components at a time (usually the top two that account for most of the variance).

Create a PCA plot using rlog transformed data. By default, DESeq2::plotPCA() will colour each dot by the sample’s experimental group, but we have included some additional code to remove these for our discussion :-).

# DESeq2's plotPCA function will create a PCA plot using an object that has rlog or vst values
pcaData <- plotPCA(rlog, returnData=TRUE)

# Convert to percentages
percentVar <- round(100 * attr(pcaData, "percentVar"))

# Plot table
ggplot(pcaData, aes(PC1, PC2)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

Each dot represents one sample. Samples that appear closer together have a more similar gene expression profile. Can you speculate which samples belong to the same experimental group?

Let’s recreate the plot, now colouring samples by their experimental group.

# DESeq2's plotPCA function will create a PCA plot using an object that has rlog or vst values
DESeq2::plotPCA(rlog)

Can you comment on how the samples cluster together in the plot?

If you saw one red dot cluster more closely with the blue dots, what might this suggest?

Apart from experimental groups, what other relationships might be revealed when looking at PCA plots?

How much of the overall variance is explained by PC1 & PC2?

Plotting the other PCs is something you may want to do until you have explored most of the variation in the dataset and what their potential sources might be. We will not have time to cover this in the workshop, but do recommend you look into scree plots and observing the genes contributing to each PC.

Sample-to-sample distances heatmap

Another way to visualise how similar or dissimilar our samples are is to plot sample distances in a heatmap and a hierarchical cluster.

# dist() calculates eucleadean distance, which requires data to be in a specific format
sampleDists <- dist(t(assay(rlog)))
sampleDistMatrix <- as.matrix(sampleDists)

# Get some pretty blue colours
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

# Plot the sampleDistMatrix in a heatmap
# pheatmap also calculates and plots hierachical clusters 
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Dark blue indicates samples that are more similar (distance = 0).

What do you notice about sample KO3?

This is the end of the breakout room session

Differential expression analysis

We are happy with what we have observed in our exploratory analysis and are finally ready to start DE analysis.

Experimental design

In order for DESeq2 to perform DE, we need to revisit the DESeqDataSet object (dds), this time telling it our experimental design. In our case, this will be the column “condition”, taken from “meta”.

Note: design formulas can be much, much more complex! This gives you the power to model and account for other variation (e.g. you could model batch effects using something like ~ condition + batch)

dds <- DESeqDataSetFromMatrix(countData = counttable,
                              colData = meta,
                              design = ~ condition)

Let’s stop here and take some time to understand dds. In the console, type in the code below:

dds

  • Notice dim - can you tell from this how many genes are are analysing?

counts(dds)

  • This extracts the count matrix out of dds

colData(dds)

  • This extracts our experimental design metadata out of dds

design(dds)

  • This extracts our design formula out of dds

Explicitly set the factors levels

When we perform differential expression and interpret the results, we want to see what changed in the knockout mice (“treatment”) compared to the wildtype mice (“control”) - not the other way around!

The experimental design information in dds is stored as a factor in R (check by running class(dds$condition - without the backslash). By default, R will choose a reference level for factors based on alphabetical order. That means, the knockout group is currently our baseline (check by typing in the console: dds$condition, without the backslash).

Note: the backslashes are required to escape the “$” as they are interpreted differently in Markdown vs R.

We will need to explicitly set “Wild” as the baseline level for easier interpretation of results.

# Set Wild to base level, using relevel
dds$condition <- relevel(dds$condition, "Wild")

# Check that Wild appears as the first level
dds$condition
## [1] Wild Wild Wild KO   KO   KO  
## Levels: Wild KO

Differential expression (DE) analysis

Before we commence with DE, there are some key concepts that you should know.

All DE methods account for the above in their own way. In this workshop, we will use and explore DESeq2’s method.

The DESeq() function

We are finally ready to perform DE analysis with DESeq2’s DESeq() function. This performs a number of steps required to perform DE - the console output gives you a clue as to what these steps are doing.

# Perform DE and store the results back in the dds object
dds <- DESeq(dds)
# Save the results to res
res <- results(dds)

In brief, by default, DESeq2 is:

  • estimating size factors (i.e. total library size), required to normalize data. DESEq2 uses the median of ratios method. There are many other normalization methods, each with their pros and cons, and each to be used and interpreted differently.
  • transforming the data by estimating dispersion (DESeq2’s way of quantifying within group variability). DESeq2 uses a negative binomial distribution model.
  • performing independent filtering to reduce the number of statistical tests to perform. DESeq2 will automatically do this. A common method to do this is by removing lowly expressed genes as these don’t have enough data confidently test for DE (DESeq2 actually recommends this to also reduce the size and memory required by DESeq())

Inspecting the results

Get a summary of results by running the code below.

summary(res)
## 
## out of 19859 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1860, 9.4%
## LFC < 0 (down)     : 1531, 7.7%
## outliers [1]       : 8, 0.04%
## low counts [2]     : 5006, 25%
## (mean count < 8)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Order by the smallest adjusted p value, and have a look at the top 5/bottom 5 DE genes.

res <- res[order(res$padj), ]
res
## log2 fold change (MLE): condition KO vs Wild 
## Wald test p-value: condition KO vs Wild 
## DataFrame with 19859 rows and 6 columns
##             baseMean log2FoldChange     lfcSE      stat       pvalue
##            <numeric>      <numeric> <numeric> <numeric>    <numeric>
## Sp110       11885.49        7.86072  0.303680   25.8849 9.85207e-148
## Echdc3       1406.80        4.52230  0.218622   20.6854  4.68414e-95
## Krt2        32524.23       -4.47847  0.218163  -20.5281  1.20752e-93
## Tie1         5235.08        3.58214  0.175835   20.3722  2.94985e-92
## Dhtkd1       2395.38        4.34892  0.215641   20.1674  1.89282e-90
## ...              ...            ...       ...       ...          ...
## Ttll8       0.571604       1.180333   3.05766  0.386024     0.699479
## Trav6-4     0.133636       1.050453   4.08047  0.257434     0.796844
## Gm5313      2.446997       0.623172   1.28897  0.483467     0.628764
## Vmn1r-ps66  0.202760       1.050453   4.08047  0.257434     0.796844
## Gm8546      0.520375       2.543885   2.60963  0.974808     0.329656
##                    padj
##               <numeric>
## Sp110      1.46254e-143
## Echdc3      3.47680e-91
## Krt2        5.97521e-90
## Tie1        1.09476e-88
## Dhtkd1      5.61979e-87
## ...                 ...
## Ttll8                NA
## Trav6-4              NA
## Gm5313               NA
## Vmn1r-ps66           NA
## Gm8546               NA

From the above, we can see that DE was performed for KO vs Wild samples for 19,859 genes and 6 columns (6 samples). We then see a table of DE results. The column headers include:

  • baseMean: this is an average of the normalized count values, dividing by size factors, taken over all samples. This gives you a general idea of how many reads were detected over all samples present for any one gene.
  • log2FoldChange: This measures the magnitude of differential expression of a gene. A positive value indicates that the KO expression was higher than Wild (remember the fuss about setting factor levels?). This number is on the logarithmic scale to base 2, e.g. log2 fold change of 1.5 means that the gene’s expression is increased by 2^1.5 = 2.82 times.
  • lfcSE: this is the standard error of the log2FoldChange estimate
  • stat: Wald statistic
  • p-value: Wald test p-value
  • padj: p-value adjusted for multiple testing. This is sometimes referred to as the false discovery rate or FDR. By default, DESeq2 performs this with the Benjamini Hochberg method. Note - DESeq2 will report “NA” (not available) values if multiple testing was not applied for this gene, usually because the counts for these gene were too low or the gene was an extreme outlier.

Defining significance

Differentially expressed genes are usually defined by cut-offs for two metrics, which are the adjusted p-value and the fold change value. We commonly see differential expression defined as genes with:

  • adjusted p-value of < 0.05 (sometimes < 0.1)
  • fold change of 2 (log2 fold change = 1)

This is somewhat arbitrary - we need to have just the right number of differentially expressed genes to perform functional enrichment analysis (around 100 - 3,000 is a general guide). Gene expression should be thought of in a biological context - we care about the “top” most differentially expressed genes.

Subset the data and write out results

Here we will use padj < 0.05 as our cut-off value for significance and use these genes for enrichment analysis.

# Redefine the significance cut-off used for independent filtering (default = 0.1). 
# This should be done if we want to use p adj to a value other than 0.1 
res_padj0.05 <- results(dds, alpha = 0.05)

# Subset the results and write these to an output file
resSig005_subset <- subset(res_padj0.05, padj < 0.05)
write.table(resSig005_subset, 
            "res_DeSeq2_FDR0.05_comparison_Wild_vs_KO_FUllMatrix.tab", 
            sep = "\t", 
            col.names = NA, 
            quote = F)

# Reformat the output results into a data.frame
resSig005_subset <- data.frame(genes = row.names(resSig005_subset), resSig005_subset)

# We can also order padj-filtered results by log fold change
resSig005_subset_lfc <- resSig005_subset[order(resSig005_subset$log2FoldChange), ]

# Notice how our summary of results has changed slightly now
summary(res_padj0.05)
## 
## out of 19859 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1353, 6.8%
## LFC < 0 (down)     : 984, 5%
## outliers [1]       : 8, 0.04%
## low counts [2]     : 5006, 25%
## (mean count < 8)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Normalized count data can be used for visualisation/other analyses. The code below extracts and prints normalized counts to file.

# Extract normalized counts from dds
normalised_counts <- counts(dds, normalized = TRUE)

# Save normalized counts (tab separated) to file
write.table(normalised_counts, 
            "normalised_all_samples_DeSeq2_FUllMatrix.tab", 
            sep = "\t", 
            col.names = NA, 
            quote = F)

Visualise results

Volcano plot

The volcano plot is a scatterplot that shows magnitude of change (fold change, x axis) against statistical significance (p-value, y axis). It provides an overall visual snapshot of the number of up and downregulated genes that are statistically significant.

# Create a basic volcano plot (scatter plot) with x axis = LogFC, y axis = -log10(pvalue)
resdata <- as.data.frame(res)

# Define whether genes are significantly DE or not and store this in a new column called DE
resdata$Significant <- "No"
resdata$Significant[resdata$log2FoldChange > 1 & resdata$pvalue < 0.05 ] <- "Upregulated"
resdata$Significant[resdata$log2FoldChange < -1 & resdata$pvalue < 0.05 ] <- "Downregulated"

# Create the volcano plot
p <- ggplot(data=resdata,
       aes(x=log2FoldChange, y=-log10(pvalue), col=Significant)) + geom_point()

# Add significance lines at log2FoldChange -1, 1 and pvalue 0.05
p2 <- p + geom_vline(xintercept=c(-1, 1), col = "red") +
    geom_hline(yintercept=-log10(0.05), col = "red")

# Print the plot
p2

Visualise some DE genes

We have applied low read-count filtering followed by appropriate statistical tests using the DESeq2 package for identification of the differentially expressed geens across our conditions of interest.

However we recommend that you visualise a few genes (of specific interest or otherwise) to check if the identification of these genes is supported by sufficient read-counts.

Use plotCounts function to plot normalized counts for a single gene of interest. Here we plot

plotCounts(dds, 
           gene="Dip2b", 
           intgroup="condition")

Choose a significant gene that is downregulated in the knockout mice. Enter the plotCounts code in the grey box below to plot the normalized counts for each sample for the gene you have chosen.

Diagnostic plots

Before we get too excited about our results, we need to confirm that DESeq2’s assumptions were met and that statistical analysis was performed appropriately. We will explore a few plots and concepts to better understand what is happening under the hood.

MA plot

The MA plot provides an overview of the relationship between significantly differentially expressed genes, gene expression and log fold change in the form of a scatter plot. Each dot on the plot represents a single gene, significant genes are coloured as a blue dot. The average gene expression is on the x axis (expressed as a mean of normalized counts) and the log fold change is on the Y axis.

# There is another R function called plotMA, so we need to specify to use DESeq2's plotMA
DESeq2::plotMA(res, ylim = c(-12, 12))

There are a few things we notice:

  • genes with a lower mean expression have higher variable log fold changes (heteroskedatic - as we expected)
  • that gene expression is symmetrical around log fold change 0.

The above describe the expected characteristics of an MA plot - if you have something that looks different, you will need to investigate further and confirm you have no unwanted technical/biological artefacts skewing the results.

Dispersion estimates

The dispersion plot is useful to examine whether your data is meeting DESeq2’s assumptions around heteroskedasticity and that the data fits DESeq2’s model well. Dispersion is how DESeq2 quantifies variability in the data. It considers variance and mean expression within each experimental group.

Let’s use plotDispEsts() to generate the dispersion plot and discuss what this means.

# Plot dispersion estimates using dds
# Note - we have set our experimental design to ~ condition and it is using this to estimate dispersion
plotDispEsts(dds)

There are a few things to note:

  • Dispersion is higher for genes with small mean of normalized counts, and lower for genes with high mean of normalized counts. If you see any other trend, this is a sign that you should not trust DESeq2’s results and that you need to investigate further
  • To transform the data, we need to use the variation observed within each experimental group. We cannot do this accurately with few biological replicates (e.g n =3 for KO, n = 3 for wildtype). DESeq2 assumes that genes with similar expression have a similar level of dispersion to get a more accurate estimation of variability - one of its many benefits! A model (the red curve) is calculated by DESeq2 with this information, using a method called shrinkage. In other words, the red line represents the expected dispersion for any given level of expression
  • The black dots represent each gene and their own dispersion (using within group variance as described above)
  • The gene-wise dispersion estimate (black dots) need to be shrunken towards the red line. This helps to reduce false positive results in our differential expression analysis

There is a lot happening here, but the main point is that our dispersion plot looks as expected and plots should generally appear like this. Check this website for a deeper explanation of this concept, and for examples of what bad dispersion plots look like.

Histogram of p-values

Remember that for every gene, we perform a statistical test to determine whether gene expression is significantly different in the knockout samples, compared to the wildtype. This results in thousands (~20,000 genes in the mouse genome) of p-values. We can look at the histogram of p-values to see how our well our statistical test behaves before we apply correction for multiple testing.

# Bin frequency of p-value counts by 0.05 incremets (i.e plot 20 columns from p-value of 0 to 1)
hist(res$pvalue, 
     breaks = 20, 
     col = "grey")

A nice histogram of p-values will have a peak at the 0.05 end, and a uniform frequency at all other p-value bins. Think back to your null and alternate hypothesis. Under the null hypothesis, there is a 5% chance of genes will fall under p-value 0.05, 10 % for p-value under 0.1, etc. The high peak at the first bin (p-value 0 - 0.5) represents genes that reject the null hypothesis (in addition to all the false discoveries - hence our need to adjust for multiple testing!).

A histogram of p-values that looks anything other than what is described above means that something weird has happened and you may need to contact your local statistician/bioinformatician.

This blog post has a nice explanation of each scenario if you want to explore this further.

Functional enrichment analysis

We now have a list of significant DE genes. To gain greater biological insights on the DE genes we can determine if there is enrichment of known biological functions, pathways or interactions.

There are different methods for identifying functional enrichments, e.g.:

Enrichment analysis can be done using both- freely available web-tools or R-based tools. Today we will use a R-package “clusterprofiler” to perform Over representation analysis (ORA) against the category gene ontology.

Gene ontology (GO) enrichment

One of the most widely-used categorizations is Gene Ontology (GO) established by the Gene Ontology project. To describe the functional roles of genes and gene products, they are categorized into GO terms. GO terms are organized into three independent controlled vocabularies (ontologies) in a species-independent manner: - Biological process: Refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input. - Molecular function: Represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”. - Cellular component: Refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.

GO enrichment analysis tools will determine GO terms that are enriched when you supply it will a list of DE genes.

clusterProfiler R-package

The clusterProfiler package implements methods to analyze and visualize functional profiles of genes. The clusterProfiler R-library supports functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation. It provides a tidy interface to access, manipulate, and visualize enrichment results to help users achieve efficient data interpretation.

Prepare the DE gene results for enrichment analysis

We will use clusterProfiler to perform enrichment analysis for upregulated and downregulated significant DE genes separately. We can use the resSig005_subset_lfc dataframe created earlier and prepare the data for clusterProfiler.

# We will use the resSig005_subset_lfc dataframe created earlier
# This has been filtered for padj < 0.05 and |LFC| > 1 DE genes

# Upregulated genes
# Filter for significant upregulated genes by log2 fold change > 1. Remove NAs.
sig.up <- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange > 1, ]
sig.up <- na.omit(sig.up)

# Create list by decreasing log fold change for upregulated genes
sig.up.LFC <- sig.up$log2FoldChange
names(sig.up.LFC) <- rownames(sig.up)

# Sort by LFC, decreasing
sig.up.LFC <- sort(sig.up.LFC, decreasing = TRUE)

# Downregulated genes - let's do the same thing
sig.dn <- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange < -1, ]

# Filter for significant upregulated genes by log2 fold change < -1. Remove NAs.
sig.dn <- na.omit(sig.dn)

# Create list by decreasing log fold change for upregulated genes
sig.dn.LFC <- sig.dn$log2FoldChange
names(sig.dn.LFC) <- rownames(sig.dn)

# Sort by LFC, decreasing
sig.dn.LFC <- sort(sig.dn.LFC, decreasing = TRUE)

You can check that you have correctly prepared the data by inspecting sig.up.LFC and sig.dn.LFC in the console. Use class() to check what type of R class we have stored the data in.

GO enrichment of upregulated genes

clusterprofiler implements the function enrichGO() for gene ontology over-representation test. Let’s run the analysis for up-regulated genes.

You can check what the different parameters in the function do by running the command ?enrichGO. The command will take a few minutes to run.

ego.up <- enrichGO(gene = names(sig.up.LFC),
                   OrgDb = org.Mm.eg.db, 
                   keyType = 'SYMBOL',
                   readable = FALSE,
                   ont = "ALL",
                   pAdjustMethod = "BH",
                   pvalueCutoff = 0.05, 
                   qvalueCutoff = 0.05)

Bar plot of upregulated enriched GO terms

The bar plot is a commonly used method to visualize enriched terms. It depicts the enrichment scores (e.g. p-adj values) and gene count or ratio as bar height and colour.

barplot(ego.up, 
        showCategory = 20)

Dot plot of upregulated enriched GO terms

A dot plot is similar to a scatter plot and bar plot with the capability to encode another score as dot size. In R the dot plot displays the index (each category) in the vertical axis and the corresponding value in the horizontal axis, so you can see the value of each observation following a horizontal line from the label.

dotplot(ego.up, 
        showCategory = 20,
        font.size = 8)

cnetplot of upregulated enriched GO terms

Both the barplot and dotplot only displayed most significant enriched terms, while users may want to know which genes are involved in these significant terms. The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.

cnetplot(ego.up, 
         categorySize = "pvalue", 
         foldChange = sig.up.LFC,
         cex_label_gene = 0.7,
         showCategory = 5,
         cex_label_category = 1.2,
         shadowtext = 'category')

Heatmap-like functional classification of upregulated enriched GO terms

The heatplot is similar to cnetplot, while displaying the relationships as a heatmap. The gene-concept network may become too complicated if user want to show a large number significant terms. The heatplot can simplify the result and more easy to identify expression patterns.

heatplot(ego.up,
         showCategory = 20,
         foldChange = sig.up.LFC)

GO enrichment of downregulated genes

A breakout room task: Try re-doing the above steps for downregulated genes. Please remember to rename the variable differently so as to not over-write them e.g. ego.dn in place of ego.up etc…*

This will be a good time to familiarise yourself with the various functions you have used. Remember, you can try and run ?Function_NAME e.g. ?enrichGO to get the help manual pages for that function. You can play with pvalue and qvalue cutoffs, categories etc. Try and interpret the results. We will discuss them at the end of this session.

Other enrich functions in clusterprofiler

ClusterProfiler has other functions:

  • enrichKEGG KEGG
  • enrichWP WikiPathways
  • enrichPathway Reactome
  • enrichr Enrichr resource

Other resources for enrichment analysis

Open source options

And many more …

Commercial options

Interesting read

Urgent need for consistent standards in functional enrichment analysis https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009935#sec007

sessionInfo()

It is good practice to record the version of R and all tools you are using for reproducibility purposes (and for the methods section in your paper!). R’s sessionInfo() prints all of this information.

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pathfindR_1.6.1             pathfindR.data_1.1.1       
##  [3] clusterProfiler_4.2.2       biobroom_1.26.0            
##  [5] broom_0.7.9                 org.Mm.eg.db_3.14.0        
##  [7] AnnotationDbi_1.56.2        annotables_0.1.91          
##  [9] biomaRt_2.50.3              pheatmap_1.0.12            
## [11] forcats_0.5.1               stringr_1.4.0              
## [13] purrr_0.3.4                 readr_2.0.0                
## [15] tidyr_1.1.3                 tidyverse_1.3.1            
## [17] tibble_3.1.3                dplyr_1.0.7                
## [19] rstudioapi_0.13             devtools_2.4.2             
## [21] usethis_2.0.1               factoextra_1.0.7           
## [23] ggplot2_3.3.5               gplots_3.1.1               
## [25] RColorBrewer_1.1-2          edgeR_3.36.0               
## [27] limma_3.50.3                DESeq2_1.34.0              
## [29] SummarizedExperiment_1.24.0 Biobase_2.54.0             
## [31] MatrixGenerics_1.6.0        matrixStats_0.62.0-9000    
## [33] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [35] IRanges_2.28.0              S4Vectors_0.32.4           
## [37] BiocGenerics_0.40.0        
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2             tidyselect_1.1.1       RSQLite_2.2.7         
##   [4] grid_4.1.0             BiocParallel_1.28.3    scatterpie_0.1.6      
##   [7] munsell_0.5.0          codetools_0.2-18       withr_2.4.2           
##  [10] colorspace_2.0-2       GOSemSim_2.20.0        filelock_1.0.2        
##  [13] highr_0.9              knitr_1.33             DOSE_3.20.1           
##  [16] labeling_0.4.2         GenomeInfoDbData_1.2.7 polyclip_1.10-0       
##  [19] bit64_4.0.5            farver_2.1.0           rprojroot_2.0.2       
##  [22] downloader_0.4         treeio_1.18.1          vctrs_0.3.8           
##  [25] generics_0.1.0         xfun_0.25              BiocFileCache_2.2.1   
##  [28] doParallel_1.0.16      R6_2.5.0               graphlayouts_0.7.1    
##  [31] locfit_1.5-9.4         bitops_1.0-7           cachem_1.0.5          
##  [34] fgsea_1.20.0           DelayedArray_0.20.0    assertthat_0.2.1      
##  [37] scales_1.1.1           ggraph_2.0.5           enrichplot_1.14.2     
##  [40] gtable_0.3.0           processx_3.5.2         tidygraph_1.2.0       
##  [43] rlang_0.4.11           genefilter_1.76.0      splines_4.1.0         
##  [46] lazyeval_0.2.2         BiocManager_1.30.16    yaml_2.2.1            
##  [49] reshape2_1.4.4         modelr_0.1.8           backports_1.2.1       
##  [52] qvalue_2.26.0          tools_4.1.0            ellipsis_0.3.2        
##  [55] jquerylib_0.1.4        sessioninfo_1.1.1      Rcpp_1.0.7            
##  [58] plyr_1.8.6             progress_1.2.2         zlibbioc_1.40.0       
##  [61] RCurl_1.98-1.3         ps_1.6.0               prettyunits_1.1.1     
##  [64] viridis_0.6.1          haven_2.4.3            ggrepel_0.9.1         
##  [67] fs_1.5.0               magrittr_2.0.1         data.table_1.14.0     
##  [70] DO.db_2.9              reprex_2.0.1           ggnewscale_0.4.5      
##  [73] pkgload_1.2.1          hms_1.1.0              patchwork_1.1.1       
##  [76] evaluate_0.14          xtable_1.8-4           XML_3.99-0.6          
##  [79] readxl_1.3.1           gridExtra_2.3          testthat_3.0.4        
##  [82] compiler_4.1.0         shadowtext_0.0.8       KernSmooth_2.23-20    
##  [85] crayon_1.4.1           htmltools_0.5.1.1      ggfun_0.0.2           
##  [88] tzdb_0.1.2             geneplotter_1.72.0     aplot_0.0.6           
##  [91] lubridate_1.7.10       DBI_1.1.1              tweenr_1.0.2          
##  [94] dbplyr_2.1.1           MASS_7.3-54            rappdirs_0.3.3        
##  [97] Matrix_1.3-3           cli_3.0.1              parallel_4.1.0        
## [100] igraph_1.2.6           pkgconfig_2.0.3        rvcheck_0.1.8         
## [103] foreach_1.5.1          xml2_1.3.2             ggtree_3.2.1          
## [106] annotate_1.72.0        bslib_0.2.5.1          XVector_0.34.0        
## [109] rvest_1.0.1            yulab.utils_0.0.5.001  callr_3.7.0           
## [112] digest_0.6.27          Biostrings_2.62.0      rmarkdown_2.10        
## [115] cellranger_1.1.0       fastmatch_1.1-3        tidytree_0.3.4        
## [118] curl_4.3.2             gtools_3.9.2           lifecycle_1.0.0       
## [121] nlme_3.1-152           jsonlite_1.7.2         desc_1.3.0            
## [124] viridisLite_0.4.0      fansi_0.5.0            pillar_1.6.2          
## [127] lattice_0.20-44        KEGGREST_1.34.0        fastmap_1.1.0         
## [130] httr_1.4.2             pkgbuild_1.2.0         survival_3.2-11       
## [133] GO.db_3.14.0           glue_1.4.2             remotes_2.4.0         
## [136] iterators_1.0.13       png_0.1-7              bit_4.0.4             
## [139] ggforce_0.3.3          stringi_1.7.3          sass_0.4.0            
## [142] blob_1.2.2             caTools_1.18.2         memoise_2.0.0         
## [145] ape_5.5

Authors and acknowledgement

The authors of the code contained in this .Rmd document are:

We hope you have found it useful - you are welcome to use and modify for this for your own analysis. If you have found it helpful, please support us by acknowledging us.

The authors acknowledge bioinformatics support and advanced computing resources provided by the Sydney Informatics Hub, a Core Research Facility at the University of Sydney, Pawsey Supercomputing Research Centre, Queensland Cyberinfrastructure Foundation (QCIF) and Australia’s National Research Education Network (AARNet) enabled through the Australian BioCommons (NCRIS via Bioplatforms Australia).

Knit your document

If you want to knit this .Rmd file to a HTML file, click Knit > Knit to HTML. You will then need to copy the HTML file from Nimbus to your local computer.

Transfer your html document to local computer

  1. Open a terminal on you local computer. Navigate to a path of your choice.

  2. Run the following command in the terminal on your local computer (make sure to replace the IP address with your own)

scp training@146.118.64.117:~/base_directory/working_directory/RNASeq_Part2_*.html ./
  1. Open the html document using a local browser such as Chrome.