0. Working directory

Ensure the ‘workshop’ directory is your current working directory:

getwd()
## [1] "/home/user3/workshop"

1. Load input data and extract ranked gene list

We will once again load the Pezzini RNAseq dataset.

data <- read_tsv("Pezzini_DE.txt", col_names = TRUE, show_col_types = FALSE)
head(data)

This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest) to make our ranked gene list for GSEA.

What class of R object does our ranked gene list need to be in?

?gseKEGG

We can see geneList = order ranked geneList. Not informative!

Unfortunately, the required type of object is detailed under the enrichKEGG function, not the gseKEGG function! For enrichKEGG, the parameter gene is described as requiring “a vector of entrez gene id”, yet for gseKEGG, the description for geneList is “order ranked geneList”. There is a little bit of sleuthing required at times!

# Ranked gene list vector for GSEA
ranked <- setNames(data$Log2FC[order(-data$Log2FC)], data$Gene.ID[order(-data$Log2FC)])

# Inspect the vector                
head(ranked)
## ENSG00000079931 ENSG00000205403 ENSG00000169903 ENSG00000165973 ENSG00000163273 
##       10.263151        9.472641        9.395927        9.019168        8.983136 
## ENSG00000169908 
##        8.793216
tail(ranked)
## ENSG00000151615 ENSG00000125538 ENSG00000250366 ENSG00000012223 ENSG00000108688 
##       -7.436121       -7.685643       -7.759835       -7.829832       -7.887188 
## ENSG00000169429 
##       -8.350671

2. Check gseKEGG function arguments and requirements

Review the parameters:

?gseKEGG

Most of those defaults look suitable to start.

We have human so the default organism = "hsa" argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from KEGG Organisms or using the clusterProfiler function search_kegg_organism.

Pick your favourite species and search for the KEGG organism code by editing the variable ‘fave’ then executing the code chunk:

fave <- "horse"

# search by common_name or scientific_name or kegg.code 
search_kegg_organism(fave, by = "common_name")

Now back to the parameters - the defaults for P value correction and filtering, and gene set size limits are acceptable. We don’t want to use internal data (i.e. we want to search against the latest KEGG online) so FALSE is apt here, and we do want to use the fgsea algorithm for analysis.

For the seed parameter, we should provide a value, to ensure results are the same each time the command is run. By setting a seed, you fix the sequence of random numbers generated within the GSEA algorithm.

We also need to check the keyType (gene namespace) against the input data that we have. From the help page, we can see the supported namespaces for the KEGG database are one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’.

Going back to where we loaded the input data and ran head to view the first few rows, we can see our input has ENSEMBL gene IDs as well as official gene symbols. ENSEMBL gene IDs are generally preferable for bioinformatics analyses because they are more unique and stable compared to gene symbol.

Since our input data does not match any of the valid namespaces, we need to convert gene IDs! clusterProfiler has the bitr function to do this. BiomaRt is also a popular R package for this task.

3. Convert gene IDs

Check the usage for the bitr function:

?bitr

We need to understand what the valid fromType and toType values are, and it turns out we need an Org.db to use bitr! This is a Bioconductor annotation package, of which there are currently only 20. So while the gseKEGG function supports all organisms in KEGG, performing gene ID conversion within clusterProfiler may not be possible for non-model species and you would need to seek a different method.

We have already loaded the org.Hs.eg.db annotation library. We can use this to search keytypes:

keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

Converting to ENTREZ will be compatible with kegg. So our fromType is ENSEMBL and toType is ENTREZID. Note that these are case sensitive!

If we were to run enrichGO or gseaGO that require a Bioconductor Org.db package, we would not need to do a conversion, as both of the gene ID types within our input data (ENSEMBL and SYMBOL) are natively supported.

Gene ID conversion often results in duplicates. The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.

First, convert the ENSEMBL gene IDs from our ‘ranked’ list to ENTREZ gene IDs:

# Convert the Ensembl IDs to Entrez IDs, dropping NAs
converted_ids <- bitr(names(ranked), 
                      fromType = "ENSEMBL", 
                      toType = "ENTREZID", 
                      OrgDb = org.Hs.eg.db, 
                      drop = TRUE)
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(names(ranked), fromType = "ENSEMBL", toType = "ENTREZID", :
## 0.78% of input gene IDs are fail to map...

<1% failing to map is pretty good.

The 1:many mappings warning means that some of our gene IDs matched more than 1 ENTREZ ID. We need to ensure that the final list we provide to GSEA does not contain duplicates. This needs to happen at two stages: first of all, ensuring that each ENSEMBL ID is mapped to only one ENTREZ ID, and then once the final converted vector has been created, check it for duplicated ENTREZ IDs, which could occur when two different ENSEMBL IDs from our input map to the the same ENTREZ ID.

The below code does this by selecting the first of each set of duplicates. This is not ideal. In a real experiment, you should print out the duplicates and directly manage how to handle them by reviewing the gene IDs involved and deciding whether it is valid to select one ID over another, or at times you may choose to merge the values for duplicate genes.

# Handle any `1:many mapping` (ENSEMBL ID matches >1 ENTREZID) by keeping only the first Entrez ID for each Ensembl ID (not ideal, see note above)
converted_ids <- converted_ids[!duplicated(converted_ids$ENSEMBL), ]

# Keep only the rows with valid Entrez IDs
converted_ids <- converted_ids[!is.na(converted_ids$ENTREZID), ]

# Filter the ranked gene list to include only the successfully mapped Ensembl IDs
ranked_entrez <- ranked[names(ranked) %in% converted_ids$ENSEMBL]

# Replace the Ensembl IDs with the corresponding Entrez IDs in the ranked vector
names(ranked_entrez) <- converted_ids$ENTREZID[match(names(ranked_entrez), converted_ids$ENSEMBL)]

# Remove duplicates from the vector by keeping only the first occurrence of each Entrez IDs (not ideal, see note above) 
ranked_entrez <- ranked_entrez[!duplicated(names(ranked_entrez))]

# Inspect the vector                    
head(ranked_entrez)
##     26002      3426      7104      4745      4880      4071 
## 10.263151  9.472641  9.395927  9.019168  8.983136  8.793216
tail(ranked_entrez)
##      5458      3553 100507043      4057      6354      3576 
## -7.436121 -7.685643 -7.759835 -7.829832 -7.887188 -8.350671

Looks good! Now we are ready to enrich!

4. Run GSEA over KEGG database

Now run GSEA over the KEGG database with gseKEGG function, setting a seed value and kegg as our gene ID type. this may take a minute or two to run.

gsea_kegg <- gseKEGG( 
  ranked_entrez, # ranked gene list vector object
  organism = "hsa", # species
  keyType = "kegg", # gene ID type/namespace
  exponent = 1, # weight of each step
  minGSSize = 10, # minimum gene set size
  maxGSSize = 500, # maximum gene set size
  eps = 1e-10, # sets the boundary for calculating the p value
  pvalueCutoff = 0.05, # adjusted P value cutoff for enriched terms
  pAdjustMethod = "BH", # multiple testing correction with BH method 
  verbose = TRUE, # print output as it runs
  use_internal_data = FALSE, # use latest online KEGG data not local data
  seed = 123, # set seed for random number generation, for reproducibility 
  by = "fgsea" # GSEA algorithm 
)
## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.01% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
## leading edge analysis...
## done...

We have 3 warnings, one about ties in the ranked list, which we could resolve manually by reviewing the raw data, or just ignore it as it is only 0.01% of the list.

The second warning may be resolved by following the suggestion to set nPermSimple = 10000. How frustrating that this parameter is not described within the gseKEGG help menu, nor the clusterProfiler PDF at all!

Let’s rerun following both suggestions, to use permutations (nPermSimple = 1000) and set eps = 0. We will keep the rest of the arguments the same as before.

gsea_kegg <- gseKEGG( 
  ranked_entrez, 
  organism = "hsa", 
  keyType = "kegg", 
  exponent = 1, 
  minGSSize = 10, 
  maxGSSize = 500, 
  eps = 0, # changed to zero 
  pvalueCutoff = 0.05, 
  pAdjustMethod = "BH",
  verbose = TRUE, 
  use_internal_data = FALSE, 
  seed = 123, 
  by = "fgsea",
  nPermSimple = 10000 # added permutations
)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.01% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...

Great, those last 2 warnings have resolved and we only have the expected one about tied rankings.

5. Visualise GSEA KEGG results

Tabular results

First, let’s preview the results.

print(gsea_kegg)
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism     hsa 
## #...@setType      KEGG 
## #...@keytype      kegg 
## #...@geneList     Named num [1:14294] 10.26 9.47 9.4 9.02 8.98 ...
##  - attr(*, "names")= chr [1:14294] "26002" "3426" "7104" "4745" ...
## #...nPerm     
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...43 enriched terms found
## 'data.frame':    43 obs. of  11 variables:
##  $ ID             : chr  "hsa04110" "hsa04657" "hsa03040" "hsa04062" ...
##  $ Description    : chr  "Cell cycle" "IL-17 signaling pathway" "Spliceosome" "Chemokine signaling pathway" ...
##  $ setSize        : int  154 65 129 129 100 31 36 52 182 279 ...
##  $ enrichmentScore: num  -0.562 -0.684 -0.54 -0.531 -0.523 ...
##  $ NES            : num  -2.32 -2.47 -2.18 -2.15 -2.04 ...
##  $ pvalue         : num  6.13e-12 6.52e-10 2.36e-09 6.71e-09 1.35e-06 ...
##  $ p.adjust       : num  2.08e-09 1.11e-07 2.67e-07 5.69e-07 9.14e-05 ...
##  $ qvalue         : num  1.64e-09 8.76e-08 2.12e-07 4.50e-07 7.23e-05 ...
##  $ rank           : num  1957 251 5019 1534 5331 ...
##  $ leading_edge   : chr  "tags=42%, list=14%, signal=37%" "tags=18%, list=2%, signal=18%" "tags=66%, list=35%, signal=43%" "tags=22%, list=11%, signal=20%" ...
##  $ core_enrichment: chr  "11200/9587/10971/10735/23244/64682/1387/5933/9126/9232/27085/7027/10274/994/5591/8379/9134/4175/5347/7465/80174"| __truncated__ "2353/4318/7128/6347/5743/3627/2920/4312/7124/3553/6354/3576" "9716/6637/57819/6429/4116/4809/10569/84991/6626/25804/22827/11017/9092/9984/9775/102724594/51729/151903/22938/2"| __truncated__ "2870/7074/9475/57580/408/4790/1794/53358/107/4792/6011/109/10663/6376/2786/157/9844/6373/9547/6366/6347/114/362"| __truncated__ ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z

At the line ## 'data.frame': 43 obs. of 11 variables, 43 is the number of significant enriched terms, and 11 is the number of columns in the output dataframe.

Extract results to a TSV file. This will print the significant enrichments, sorted by adjusted P value.

write.table(gsea_kegg, file = "clusterProfiler_gseKEGG_results.tsv", sep = "\t", quote = FALSE, row.names = FALSE)

You can view the table in the editor pane by clicking on it from the Files pane.

Plots

One of the key advantages of using R over web tools is flexibility with visualisations. Next we will produce a range of plot types from the enrichplot package, developed by the same authors as clusterProfiler.

In this package, some of the plots can be used to plot ORA or GSEA results, and others only for one type or the other.

To determine which plot functions are compatible with which FEA results type, you can check the help menu. If you see ## S4 method for signature 'enrichResult' this plot is compatible with ORA results object. If you see ## S4 method for signature 'gseaResult' this plot is compatible with GSEA results object.

Review the help menu for the dotplot function, which can plot both ORA and GSEA:

?enrichplot::dotplot

And review the help menu for ridgeplot which can only plot GSEA results:

?enrichplot::ridgeplot

Unfortunately, enrichplot is only compatible with the enrichment results from the packages produced by this development team, although the desire to use enrichplot with the output of other tools is widespread. The R package multienrichjam has a function enrichDF2enrichResult that converts ORA dataframe results from other FEA tools to the format required for enrichplot.

multienrichjam has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)

# github remotes package required:
#install.packages("remotes")
#library(remotes)

# install multienrichjam:
#remotes::install_github("jmw86069/multienrichjam", dependencies=TRUE)
#library(multienrichjam)

# check function help: 
#?enrichDF2enrichResult

dotplot

A dotplot is an easy to interpret plot. The size of the dot is proportional to the number of genes (count) in that term. A higher GeneRatio indicates that a larger portion of the gene set is significantly altered or associated with the observed changes in gene expression.

We can change the number of top terms that are plotted with the showCategory parameter.

enrichplot::dotplot(
  gsea_kegg,
  x = "geneRatio",
  color = "p.adjust",
  orderBy = "x",
  showCategory = 15,
  font.size = 8
  ) +
  ggtitle("GSEA KEGG")

upsetplot

The upset plot shows the pattern of shared genes among the enriched terms (since a gene can belong to multiple terms or pathways).

For GSEA, the plot includess fold change distributions on the Y axis.

n controls the number of terms plotted - here we have restricted to 10 for ease of viewing. Plot view is quite small within a notebook, but if you wanted to plot more terms, you could print to an A4 size output file for enhanced clarity.

# n = the number of top enrichments to plot
enrichplot::upsetplot(gsea_kegg, n = 15)

emapplot

The enrichment map plot organises enriched terms into a network with edges connecting overlapping gene sets. Overlapping gene sets cluster together.

It is required to run the function pairwise_termsim before producing the plot.

When running pairwise_termsim, we don’t need to save the output to a new object, because the information is added to the object and does not change any other attributes.

Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.

# calculate pairwise similarities between the enriched terms
gsea_kegg <- enrichplot::pairwise_termsim(gsea_kegg)

# plot
enrichplot::emapplot(gsea_kegg, showCategory = 25)

treeplot

The treeplot provides the same information as the emapplot but presented in a different way.

Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and ‘cluster tags’ assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument nWords however running that will throw an error (it says ‘warning’ but it is fatal so to me that’s an error!):

“Warning: Use ‘cluster.params = list(label_words_n = your_value)’ instead of ‘nWords’. The nWords parameter will be removed in the next version.”

This plot also requires the pairwise similarity matrix calculation that emapplot does. Since we have already run it, it is hashed out in the code chunk below.

Plot view within the notebook affects the clarity of the cluster tags, however printing this plot to a file would resolve that.

# calculate pairwise similarities between the enriched terms
#gsea_kegg <- enrichplot::pairwise_termsim(gsea_kegg)

enrichplot::treeplot(gsea_kegg, showCategory = 15, color = "p.adjust", cluster.params = list(label_words_n = 5))

cnetplot

The cnetplot depicts the linkages of genes and terms as a network. This is helpful to understanding which genes are involved in the enriched terms, ading a level of detail not offered by the plots we have generated so far.

For GSEA, where all genes (not just DEGs) are used, only the ‘core’ enriched genes are used to create the network plot. These are the ‘leading edge genes’, those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.

There are a few parameters to play around with here to get a readable plot.

Try changing the number of terms plotted (showCategory).

Try changing the node_label which controls whether labels are put on terms (category), genes (item), or both (all).

Since the information that is attempted to be plotted is complex, having a large number of terms and attempting to label everything won’t look very informative! If you want to plot both gene IDs and term names, you will need to plot a small number of categories.

# cex_label_gene to reduce the font size for gene ID
# node_label = one of ’all’, ’none’, ’category’ and ’item’
# showCategory = number of terms to plot

enrichplot::cnetplot(gsea_kegg, showCategory = 8, cex_label_gene = 0.5, cex_label_category = 0.8, colorEdge = TRUE, node_label = "category")

With 8 terms and no gene IDs, we don’t get any more detail than from the treeplot or emapplot.

This plot is really useful for showing a detailed look at a small number of terms. Just plotting the top 3 terms may not look very helpful (try plotting the top 3!)

A useful application is plotting the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest.

The 3 terms listed below are for ‘IL-17 signaling pathway’, ‘Viral protein interaction with cytokine and cytokine receptor’ and ‘Chemokine signaling pathway’ which are among the top 10 enrichments with a relationship of shared genes, evident from the plot above.

Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).

# Select terms of interest 
select_terms <- c("hsa04657", "hsa04061", "hsa04062") 

# Filter the gseaResult object based on the term IDs
# must keep it as a gsea result type and not as a dataframe when filtering
gsea_kegg_select <- gsea_kegg
gsea_kegg_select@result <- gsea_kegg@result[gsea_kegg@result$ID %in% select_terms, ]

# Plot with cnetplot
enrichplot::cnetplot(gsea_kegg_select, cex_label_gene = 0.5, cex_label_category = 0.8, colorEdge = TRUE, node_label = "all")

Now we can clearly see the gene IDs for this network.

heatplot

Like the upset plot, the heatplot shows shared genes across enriched terms. When there are a lot of genes, it can be poorly readable, especially within the notebook. This function does not include a parameter to restrict to ‘core’ genes like the cnetplot or ridgeplot so even when plotting a very small number of enriched terms, the X-axis is too cramped to be readable.

enrichplot::heatplot(gsea_kegg, showCategory = 3)

The gseKEGG result contains a data column containing the core/leading edge genes:

colnames(gsea_kegg@result)
##  [1] "ID"              "Description"     "setSize"         "enrichmentScore"
##  [5] "NES"             "pvalue"          "p.adjust"        "qvalue"         
##  [9] "rank"            "leading_edge"    "core_enrichment"

So we can extract those and provide these to the heatmap to try to improve clarity. The downside of course is that we then lose some of the information about which genes are shared across terms.

# create an object that contains the core genes grouped by term and their fold change values 
ranked_core_genes <- ranked[names(ranked) %in% gsea_kegg$core_enrichment]

# Plot with just core genes
enrichplot::heatplot(gsea_kegg, foldChange = ranked_core_genes, showCategory = 3)

Saving to a file only improves things slightly.

png("clusterprofiler_gseKEGG_heatplot.png", width = 11.7, height = 8.3, units = "in", res = 300)
#enrichplot::heatplot(gsea_kegg, showCategory = 3)
enrichplot::heatplot(gsea_kegg, foldChange = ranked_core_genes, showCategory = 3)
dev.off()
## png 
##   2

ridgeplot

This function is specific for GSEA. It does not apply to ORA, since it plots NES.

The height of the ridge in the ridge plot represents the density of genes from the gene set at different points in the ranked list.

The area under the curve represents the distribution of these genes across the ranked list.

Higher peaks and more concentrated areas under the curve at the top of the list indicate stronger enrichment.

enrichplot::ridgeplot(gsea_kegg, 
  showCategory = 15, 
  fill = "p.adjust", 
  core_enrichment = TRUE, 
  label_format = 30, 
  orderBy = "NES", 
  decreasing = FALSE )
## Picking joint bandwidth of 0.601

gseaplot

Another GSEA-specific plot, which shows the contribution to the normalised enrichment score (NES) of genes across the gene list.

The gseaplot can plot one term at a time. You would run GSEA plots for key terms of interest in your results.

The term ‘hsa04020’ (Calcium signaling pathway) has a positive NES, an adjusted P value of 0.00046 and a set size of 279.

gene_set_id <- "hsa04020"

enrichplot::gseaplot( gsea_kegg,
  gene_set_id, 
  by = "all", 
  title = paste("GSEA KEGG result:", gene_set_id),
  color = "black", 
  color.line = "green", 
  color.vline = "#FA5860"
)

The term ‘hsa04062’ (Chemokine signaling pathway) has a negative NE, a highly significant adjusted P value of 1.11E-06 and a set size of 129.

gene_set_id <- "hsa04062"

enrichplot::gseaplot( gsea_kegg,
  gene_set_id, 
  by = "all", 
  title = paste("GSEA KEGG result:", gene_set_id),
  color = "black", 
  color.line = "green", 
  color.vline = "#FA5860"
)

The term ‘hsa05163’ has a slightly negative NES and a borderline adjusted P value of 0.046.

gene_set_id <- "hsa05163"

enrichplot::gseaplot( gsea_kegg,
  gene_set_id, 
  by = "all", 
  title = paste("GSEA KEGG result:", gene_set_id),
  color = "black", 
  color.line = "green", 
  color.vline = "#FA5860"
)

volcano plot

The volplot function within enrichplot does not support GSEA result objects, but we can use ggplot2 for this.

ggplot2 is a highly flexible R package for visualisations, and unlike enrichplot is not constrained to a specific plot type.

The volcano plot has the advantage of showing whether the genes in the enriched terms were predominantly from the top (upregulated) or bottom (downregulated) end of the list.

ggplot2::ggplot(gsea_kegg@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) +
  geom_point(alpha = 0.7, size = 2) +  # Adjust point size
  scale_color_gradient(low = "blue", high = "red") +  # Color by p.adjust values
  theme_minimal() +
  labs(title = "GSEA KEGG",
       x = "Enrichment Score (NES)",
       y = "-log10(Adjusted P-value)",
       color = "Adjusted P-value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels for readability
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +  # Add vertical line at x=0
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +  # Add horizontal line at p=0.05 cutoff
  geom_text(aes(label = Description),
            hjust = 0.5,
            vjust = -0.5,  # Move labels higher off the points
            size = 3,
            check_overlap = TRUE,
            alpha = 0.7)  # Add labels for each pathway term

6. Save versions and session details

Database query dates

Unlike gprofiler, clusterProfiler does not have a function to list the version of the queried databases.

For this reason, we will save the analysis date to our rendered notebook, so the external database version could be back-calculated from the date if required:

cat("Date of analysis:\n")
## Date of analysis:
print(Sys.Date())
## [1] "2024-11-20"

R version and R package versions

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_3.5.1          ggupset_0.4.0          dplyr_1.1.4           
##  [4] org.Hs.eg.db_3.20.0    AnnotationDbi_1.68.0   IRanges_2.40.0        
##  [7] S4Vectors_0.44.0       Biobase_2.66.0         BiocGenerics_0.52.0   
## [10] biomaRt_2.62.0         readr_2.1.5            enrichplot_1.26.2     
## [13] clusterProfiler_4.14.3
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2               gson_0.1.0              httr2_1.0.1            
##   [4] rlang_1.1.3             magrittr_2.0.3          DOSE_4.0.0             
##   [7] ggridges_0.5.6          compiler_4.4.2          RSQLite_2.3.7          
##  [10] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
##  [13] stringr_1.5.1           pkgconfig_2.0.3         crayon_1.5.2           
##  [16] fastmap_1.1.1           dbplyr_2.5.0            XVector_0.46.0         
##  [19] labeling_0.4.3          utf8_1.2.4              rmarkdown_2.26         
##  [22] tzdb_0.4.0              UCSC.utils_1.2.0        purrr_1.0.2            
##  [25] bit_4.0.5               xfun_0.43               zlibbioc_1.52.0        
##  [28] cachem_1.0.8            aplot_0.2.3             GenomeInfoDb_1.42.0    
##  [31] jsonlite_1.8.8          progress_1.2.3          blob_1.2.4             
##  [34] highr_0.10              BiocParallel_1.40.0     prettyunits_1.2.0      
##  [37] parallel_4.4.2          R6_2.5.1                bslib_0.7.0            
##  [40] stringi_1.8.4           RColorBrewer_1.1-3      jquerylib_0.1.4        
##  [43] GOSemSim_2.32.0         Rcpp_1.0.12             knitr_1.46             
##  [46] ggtangle_0.0.4          R.utils_2.12.3          Matrix_1.7-1           
##  [49] splines_4.4.2           igraph_2.0.3            tidyselect_1.2.1       
##  [52] qvalue_2.38.0           rstudioapi_0.16.0       yaml_2.3.8             
##  [55] codetools_0.2-19        curl_5.2.1              lattice_0.22-5         
##  [58] tibble_3.2.1            plyr_1.8.9              withr_3.0.0            
##  [61] treeio_1.30.0           KEGGREST_1.46.0         evaluate_0.23          
##  [64] gridGraphics_0.5-1      BiocFileCache_2.14.0    xml2_1.3.6             
##  [67] Biostrings_2.74.0       filelock_1.0.3          pillar_1.9.0           
##  [70] ggtree_3.14.0           ggfun_0.1.7             generics_0.1.3         
##  [73] vroom_1.6.5             hms_1.1.3               munsell_0.5.1          
##  [76] scales_1.3.0            tidytree_0.4.6          glue_1.7.0             
##  [79] lazyeval_0.2.2          tools_4.4.2             ggnewscale_0.5.0       
##  [82] data.table_1.15.4       fgsea_1.32.0            fs_1.6.4               
##  [85] fastmatch_1.1-4         cowplot_1.1.3           grid_4.4.2             
##  [88] tidyr_1.3.1             ape_5.8                 colorspace_2.1-0       
##  [91] nlme_3.1-165            GenomeInfoDbData_1.2.13 patchwork_1.3.0        
##  [94] cli_3.6.2               rappdirs_0.3.3          fansi_1.0.6            
##  [97] gtable_0.3.5            R.methodsS3_1.8.2       yulab.utils_0.1.8      
## [100] sass_0.4.9              digest_0.6.35           ggrepel_0.9.6          
## [103] ggplotify_0.1.2         farver_2.1.2            memoise_2.0.1          
## [106] htmltools_0.5.8.1       R.oo_1.26.0             lifecycle_1.0.4        
## [109] httr_1.4.7              GO.db_3.20.0            bit64_4.0.5

RStudio version

Typically, we would simply run RStudio.Version() to print the version details. However, when we knit this document to HTML, the RStudio.Version() function is not available and will cause an error. So to make sure our version details are saved to our static record of the work, we will save to a file, then print the file contents back into the notebook.

# Get RStudio version information
rstudio_info <- RStudio.Version()

# Convert the version information to a string
rstudio_version_str <- paste(
  "RStudio Version Information:\n",
  "Version: ", rstudio_info$version, "\n",
  "Release Name: ", rstudio_info$release_name, "\n",
  "Long Version: ", rstudio_info$long_version, "\n",
  "Mode: ", rstudio_info$mode, "\n",
  "Citation: ", rstudio_info$citation,
  sep = ""
)

# Write the output to a text file
writeLines(rstudio_version_str, "rstudio_version.txt")
# Read the saved version information from the file
rstudio_version_text <- readLines("rstudio_version.txt")

# Print the version information to the document
rstudio_version_text
## [1] "RStudio Version Information:"                                                                                                                                                                                                                                                                           
## [2] "Version: 2023.6.1.524"                                                                                                                                                                                                                                                                                  
## [3] "Release Name: Mountain Hydrangea"                                                                                                                                                                                                                                                                       
## [4] "Long Version: 2023.06.1+524"                                                                                                                                                                                                                                                                            
## [5] "Mode: server"                                                                                                                                                                                                                                                                                           
## [6] "Citation: list(title = \"RStudio: Integrated Development Environment for R\", author = list(list(given = \"Posit team\", family = NULL, role = NULL, email = NULL, comment = NULL)), organization = \"Posit Software, PBC\", address = \"Boston, MA\", year = \"2023\", url = \"http://www.posit.co/\")"

7. Knit workbook to HTML

The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.

On the editor pane toolbar, under Preview, select Knit to HTML.

If you have already run Preview, you will see Knit instead of Preview.

The HTML file will be saved in the same directory as the notebook, and with the same filename, but the .Rmd prefix will be replaced by .html. The knit HTML will typically open automatically once complete. If you receive a popup blocker error, click cancel, and in the Files pane of RStudio, single click the gprofiler.html file and select View in Web Browser.

Note that the notebook will only successfully knit if there are no errors in the code. You can ‘preview’ HTML with code errors.