0. Working directory

Ensure the ‘workshop’ directory is your current working directory:

getwd()
## [1] "/home/user2/workshop"

1. Import emapper axolotl annotation file, GO ontology file and KEGG Pathways file

1.1 Public data sources used for this notebook

1.2 Data processing

The raw reads were processed to counts matrix and DE results broadly following https://github.com/Sydney-Informatics-Hub/RNASeq-DE using the reference genome from www.axolotl-omics.org. Differential gene expression analysis was performed in R with DESeq2 v 1.46.0, filtering for genes with at least a count of 10 in at least 2 samples. The data comprises 2 groups (proximal blastema and distal blastema) and 2 replicates per group.

A predicted proteome was created by extracting the predicted peptide sequences from the GTF file from www.axolotl-omics.org then filtering for longest isoform per gene with AGAT v 1.4.0. The predicated proteome was annotated against GO and KEGG with eggNOG emapper v 2.1.12.

1.3 Import annotation files

1.3.1 emapper proteome annotation

The emapper annotation output contains results against a number of databases including GO and KEGG, which we will focus on today.

eggnog_anno <- read_tsv("AmexG_v6.0-DD.emapper.annotations.txt", show_col_types = FALSE) 
head(eggnog_anno)

The raw annotation file provides us with ‘term ID to gene ID’ mappings for our species. We also need ‘term ID to term description’ mappings. These files are not organisms specific: we will extract only the terms that are found within our custom species annotation, to make our organism specific version.

1.3.2 GO Core Ontology

For GO, we will use the GO ‘core’ ontology file, downloaded from https://purl.obolibrary.org/obo/go.obo and included in the data files you downloaded to workshop directory earlier.

We will use the ontologyIndex package to retrieve ontology info and save to an object named ontology for later use creating the required custom database files for clusterProfiler and WebGestaltR.

ontology <- ontologyIndex::get_ontology(file = "go.obo",
  propagate_relationships = "is_a", #propagates relationships from parent terms to children
  extract_tags = "everything", # retrieve all available details for each term
  merge_equivalent_terms = TRUE) # avoid unecessary redundancy 

1.3.3 KEGG Pathways

For KEGG, we have both map and ko IDs in our emapper annotation.

ko terms (https://www.genome.jp/kegg/ko.html) represent orthologous groups of genes, which are assigned based on evolutionary relationships and functional similarity, so can provide more precise functional categorisation which can be particularly useful when working with novel species which lack curate dpathway information.

map terms (https://www.genome.jp/kegg/pathway.html) are manually drawn pathway maps representing KEGG database of molecular interaction, reaction and relation networks for: Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, Organismal Systems, Human Diseases, and Drug Development.

Today we will be working with the map terms due to database download restrictions.

Free access to the KEGG FTP downloads requires an academic subscription, to which you must confirm to be the “only user of the KEGG FTP Data”. The pathways list was available freely. As a single user, you can request academic access here https://www.pathway.jp/en/academic.html.

There is an alternate method for using ko IDs, which uses the KEGG ontology information available through the clusterProfiler functions enrichKEGG and gseKEGG. An example of the R code can be found here https://github.com/dadrasarmin/enrichment_analysis_for_non_model_organism. However, this poses a problem: as the novel species gene IDs are assigned to KEGG terms, gene:name duplicate records are identified, and duplicates must be removed in order to avoid errors running the enrichment. This loss of data will have a real impact on the results, with the importance of some terms being underestimated.

Given these considerations, we will proceed with map pathway terms :-)

The KEGG map pathway list was downloaded from https://rest.kegg.jp/list/pathway and downloaded to your workshop directory.

kegg_pathways <- read.table("kegg_pathways_2024-11-13.txt", header = FALSE, sep = "\t", col.names = c("term", "name"))
head(kegg_pathways)

2. Import axolotl DE results file and extract gene lists for ORA and GSEA

2.1 Import axolotl DE data

Load the DE results file for axolotl:

de_matrix <- read_tsv("axolotl_DE_results.txt", col_names = TRUE, show_col_types = FALSE)
head(de_matrix)

2.2 Create the ranked gene list for GSEA

Recall from the last 2 activities that clusterProfiler requires a vector object for GSEA, while WebGestaltR requires a 2-column dataframe. Since we intend to use both tools, let’s create both now:

# Create ranked vector for clusterProfiler GSEA
ranked_vector <- setNames(de_matrix$log2FoldChange, de_matrix$geneID) %>% sort(decreasing = TRUE)  # Named vector

# check
head(ranked_vector)
## AMEX60DD020778 AMEX60DD020772 AMEX60DD020773 AMEX60DD005693 AMEX60DD020780 
##      10.596655      10.583791      10.281349      10.067803       9.789798 
## AMEX60DD030633 
##       9.691689
tail(ranked_vector)
## AMEX60DD051201 AMEX60DD028124 AMEX60DD044496 AMEX60DD020182 AMEX60DD053589 
##      -7.886591      -7.939796      -8.070068      -8.255430      -9.226586 
## AMEX60DD007432 
##      -9.918767
# Create ranked dataframe for WebGestaltR GSEA# extract ranked dataframe
ranked_df <- de_matrix %>%
  arrange(desc(log2FoldChange)) %>%
  dplyr::select(geneID, log2FoldChange)

# check
head(ranked_df)
tail(ranked_df)

2.3 Create gene lists for ORA

For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5.

The matrix has already filtered out genes with very low counts so we take all genes present as the background.

# Filter for DEGs and save gene IDs as vector  
degs <- de_matrix %>%
  filter(padj <= 0.01 & abs(log2FoldChange) >= 1.5) %>%
  pull(geneID)  # Extract 

# Extract the background gene list vector 
background <- de_matrix %>%
  pull(geneID)  

# Check number of genes: 
cat("Number of DEGs:", length(degs), "\n")         # Number of DEGs
## Number of DEGs: 247
cat("Number of background genes:", length(background), "\n")   # Number of background genes
## Number of background genes: 24419
# Check format: 
head(degs)
## [1] "AMEX60DD000080" "AMEX60DD000147" "AMEX60DD001144" "AMEX60DD001307"
## [5] "AMEX60DD001377" "AMEX60DD001828"
head(background)
## [1] "AMEX60DD000001" "AMEX60DD000002" "AMEX60DD000003" "AMEX60DD000004"
## [5] "AMEX60DD000005" "AMEX60DD000006"

Note the large drop in gene numbers: 100K in GTF, 48K in predicted proteome, 24K expressed in the blastema! By reducing the number of background genes to what are expressed in the studied tissue, we can reduce falsely inflated P values and false positives within our list of enriched terms.

2.4 Save gene lists

Saving any outputs generated from R code is vital to reproducibility! You should include all analysed gene lists within the supplementary materials of your manuscript.

# Save DEGs
write.table(degs, file = "Axolotl_DEGs.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
# Save background
write.table(background, file = "Axolotl_background.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
# Save ranked
write.table(ranked_df, file = "Axolotl_rankedFC.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

3. Reformat annotation files for clusterProfiler GO and KEGG analysis

Now we have annotation files and gene lists, we will bring those together to create the custom database files required for R FEA!

3.1 Create TERM2GENE files

These are 2 column text files with the term ID (one per line) alongside the ID of the gene that maps to the term. A gene can map to many terms and thus be present on multiple lines. A term can be mapped to more than one gene and thus be present on many lines.

Check the column names of the emapper annotation file so we know which are the GO and KEGG column names:

colnames(eggnog_anno)
##  [1] "#query"         "seed_ortholog"  "evalue"         "score"         
##  [5] "eggNOG_OGs"     "max_annot_lvl"  "COG_category"   "Description"   
##  [9] "Preferred_name" "GOs"            "EC"             "KEGG_ko"       
## [13] "KEGG_Pathway"   "KEGG_Module"    "KEGG_Reaction"  "KEGG_rclass"   
## [17] "BRITE"          "KEGG_TC"        "CAZy"           "BiGG_Reaction" 
## [21] "PFAMs"

We need GOs and KEGG_Pathway columns.

3.1.1 GO TERM2GENE

Next, we will extract the GO IDs from the emapper annotation file, and wrangle into the correct format for clusterProfiler TERM2GENE.

There are several steps to this - comments have been included to outline what each step is doing.

go_term2gene <- eggnog_anno %>%
    dplyr::select(GOs, `#query`) %>% # select the GO column and the query column (axolotl gene ID) 
    dplyr::filter(GOs != "-") %>% # filter out rows where the GO ID is "-" ie no GO annotation for this gene
    separate_rows(GOs, sep = ",") %>% # split comma-delimited list of many GO terms for a gene into separate rows
    dplyr::select(GOs, `#query`) %>% # keep the GO and query columns
    distinct() %>% # remove any duplicate rows 
    drop_na() # remove rows with missing values

# Rename columns to match desired output format
colnames(go_term2gene) <- c("term", "gene")

# Save to file
write.table(go_term2gene, file = "Axolotl_GO_term2gene.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Number of rows: 
cat("Number of GO term2gene rows:", nrow(go_term2gene), "\n")
## Number of GO term2gene rows: 2839193
# Check first few rows
head(go_term2gene)

3.1.2 KEGG TERM2GENE

Here we use the same process as we did above for GO (colum name GOs), selecting a different column name for KEGG (KEGG_Pathway).

kegg_term2gene <- eggnog_anno %>%
    dplyr::select(KEGG_Pathway, `#query`) %>%  # Select the relevant columns
    dplyr::filter(grepl("map", KEGG_Pathway)) %>%  # Keep only rows where KEGG_Pathway contains 'map'
    separate_rows(KEGG_Pathway, sep = ",") %>%  # Split multiple pathways into separate rows
    dplyr::mutate(term = gsub("map:", "", KEGG_Pathway)) %>%  # Remove the "map:" prefix
    dplyr::filter(grepl("^map", term)) %>%  # Filter again to make sure we only have map pathways (after removing "map:")
    dplyr::select(term, `#query`) %>%  # Select the pathway (term) and gene columns
    distinct() %>%  # Remove duplicate rows
    drop_na()  # Remove rows with missing values


# Rename columns to match desired output format
colnames(kegg_term2gene) <- c("term", "gene")

# Save to file
write.table(kegg_term2gene, file = "Axolotl_KEGG-Pathways_term2gene.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

cat("Number of KEGG term2gene rows:", nrow(kegg_term2gene), "\n")
## Number of KEGG term2gene rows: 59305
# View result to check
head(kegg_term2gene)

3.2 TERM2NAME

3.2.1 GO

Now we will assign term descriptions to term IDs and create our TERM2NAME files.

This may take a few moments to run. It will use the ontology object we created earlier from the go.obo file.

# Create term to name table, removing duplicates, missing values and obsolete terms 
go_term2name <- go_term2gene %>% # only keep terms that are in our term2gene object (ie, mapped to axolotl)
    mutate(name = ontology$name[term]) %>% 
    dplyr::select(term, name) %>%
    distinct() %>%
    drop_na() %>%
    filter(!grepl("obsolete", name))

# Save to file
write.table(go_term2name, file = "Axolotl_GO_term2name.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Show the first few lines
head(go_term2name)

3.2.2 KEGG

The KEGG Pathways file was available for download in the correct format for TERM2NAME.

head(kegg_pathways)

Let’s restrict it to include the terms relevant to our analysis, and then print that to a file for reproducibility.

kegg_term2name <- kegg_pathways %>%
  dplyr::filter(term %in% kegg_term2gene$term) %>%  # Only keep terms that are in kegg_term2gene
  distinct() %>%  # Remove duplicate entries
  drop_na()  # Remove rows with missing values

# Save the result to a file
write.table(kegg_term2name, file = "Axolotl_KEGG_term2name.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Check first few rows
head(kegg_term2name)

3.3 Count annotations

How much of our proteome was annotated? What about our DEGs and background?

Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!

Count the number of GO terms found within the genome, and the number of genes with GO annotations:

go_total_terms<-nrow(go_term2gene)
print(paste("Total annotations to GO:", go_total_terms))
## [1] "Total annotations to GO: 2839193"
go_unique_genes <- length(unique(go_term2gene$gene))
print(paste("Number of unique genes with 1 or more annotation terms:", go_unique_genes))
## [1] "Number of unique genes with 1 or more annotation terms: 21373"

And for KEGG:

kegg_total_terms<-nrow(kegg_term2gene)
print(paste("Total annotations to KEGG Pathways:", kegg_total_terms))
## [1] "Total annotations to KEGG Pathways: 59305"
kegg_unique_genes <- length(unique(kegg_term2gene$gene))
print(paste("Number of unique genes with 1 or more annotation terms:", kegg_unique_genes))
## [1] "Number of unique genes with 1 or more annotation terms: 12226"

47,196 putative axolotl proteins were annotated. That’s around 1/4 of our predicted proteins mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it’s still unpleasant to face :-)

What of the genes in our gene list specifically? We have an uncurated proteome, yet the genes in our input matrix were expressed at a meaningful level within axolotl, so these may actually have a higher annotation percentage than all genes in the proteome.

# Filter the term2gene table to only include genes in the background gene list
go_filtered_term2gene <- go_term2gene %>% filter(gene %in% background)

# Count the number of unique background genes with at least one GO term
unique_genes_with_go <- go_filtered_term2gene %>% distinct(gene) %>% nrow()

# Calculate the percentage of background genes that have GO annotations
percent_go_unique <- (unique_genes_with_go / length(background)) * 100

# Print results
cat("Number of input genes with GO annotations:", unique_genes_with_go, "(",percent_go_unique,"%)\n")
## Number of input genes with GO annotations: 15072 ( 61.72243 %)
# Filter the term2gene table to only include genes in the background gene list
kegg_filtered_term2gene <- kegg_term2gene %>% filter(gene %in% background)

# Count the number of unique background genes with at least one GO term
unique_genes_with_kegg <- kegg_filtered_term2gene %>% distinct(gene) %>% nrow()

# Calculate the percentage of background genes that have GO annotations
percent_kegg_unique <- (unique_genes_with_kegg / length(background)) * 100

# Print results
cat("Number of input genes with KEGG Pathways annotations:", unique_genes_with_kegg, "(",percent_kegg_unique,"%)\n")
## Number of input genes with KEGG Pathways annotations: 8000 ( 32.76137 %)

As expected, the annotation % is higher for expressed genes than all predicted genes, and very much higher than the GTF of 99,088 predicted gene models (!!!) with an annotation rate of 21.6%.

This highlights a major caveat when performing FEA on non-model species: the results are only as good as the annotations behind them. Therefore, all results must be interpreted with caution. For many novel (and under-funded) species, there are little opportunities (at present) to improve the annotation. Some in-silico predicted genes appear to be highly expressed and significantly regulated yet have no significant similarity to anything in the non-redundant nucleotide or protein databases. When working with datasets like this, it is critical to explore those individual genes through other methods, in addition to trying to garner some higher level overview such as we aim to obtain from FEA. Hopefully, recent advances in AI protein modelling can help provide insights into the functions of these novel genes.

For the axolotl with only 22% of predicted genes annotated, its clear that the in-silico gene predictions within the GTF file require much curation!

4. Run clusterProfiler universal FEA functions enricher and GSEA

In the interest of time, and to try and cover as many options as possible, let’s do ORA with GO and GSEA with KEGG for both tools.

4.1 clusterProfiler ORA of GO terms

The enricher function is the ‘universal’ ORA option that accepts the TERM2GENE and TERM2NAME files we have just created.

Let’s review the help page:

?clusterProfiler::enricher

There are parameters for both adjusted P value and q value. Terms must pass all thresholds (unadjusted P, adjusted P, and q value) so the important filter will be the most stringent test applied. Let’s go with BH and 0.05 which we have used regularly within this workshop and are fairly common choices in the field.

we need to provide term2gene and term2name, and don’t specify an organism.

cp_go_ora <- enricher(
  gene = degs,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  universe = background,
  minGSSize = 10,
  maxGSSize = 500,
  TERM2GENE = go_term2gene,
  TERM2NAME = go_term2name
)
cp_go_ora
## #
## # over-representation test
## #
## #...@organism     UNKNOWN 
## #...@ontology     UNKNOWN 
## #...@gene     chr [1:247] "AMEX60DD000080" "AMEX60DD000147" "AMEX60DD001144" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...91 enriched terms found
## 'data.frame':    91 obs. of  12 variables:
##  $ ID            : chr  "GO:0048821" "GO:0070268" "GO:0031424" "GO:0019317" ...
##  $ Description   : chr  "erythrocyte development" "cornification" "keratinization" "fucose catabolic process" ...
##  $ GeneRatio     : chr  "7/145" "7/145" "7/145" "5/145" ...
##  $ BgRatio       : chr  "45/15072" "46/15072" "48/15072" "18/15072" ...
##  $ RichFactor    : num  0.156 0.152 0.146 0.278 0.278 ...
##  $ FoldEnrichment: num  16.2 15.8 15.2 28.9 28.9 ...
##  $ zScore        : num  10.04 9.92 9.68 11.66 11.66 ...
##  $ pvalue        : num  2.20e-07 2.58e-07 3.49e-07 5.96e-07 5.96e-07 ...
##  $ p.adjust      : num  0.000266 0.000266 0.000266 0.000266 0.000266 ...
##  $ qvalue        : num  0.000242 0.000242 0.000242 0.000242 0.000242 ...
##  $ geneID        : chr  "AMEX60DD002868/AMEX60DD025537/AMEX60DD026264/AMEX60DD026267/AMEX60DD032898/AMEX60DD032958/AMEX60DD032960" "AMEX60DD004554/AMEX60DD010118/AMEX60DD016775/AMEX60DD038318/AMEX60DD039599/AMEX60DD039603/AMEX60DD039606" "AMEX60DD004554/AMEX60DD010118/AMEX60DD016775/AMEX60DD038318/AMEX60DD039599/AMEX60DD039603/AMEX60DD039606" "AMEX60DD004640/AMEX60DD004644/AMEX60DD004647/AMEX60DD004651/AMEX60DD017241" ...
##  $ Count         : int  7 7 7 5 5 5 16 5 5 7 ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z

91 significantly enriched terms at P.adj < 0.05.

Look at the geneRatio column: our gene list object degs has 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations.

Results would be the same if we instead used annotated_degs object.

Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).

Save the results to a text file:

file <- "Axolotl_clusterProfiler_GO_ORA_results.tsv"
write.table(cp_go_ora, file, sep = "\t", quote = FALSE, row.names = FALSE)  

Let’s visualise with one of my favourite enrichplot plots, the treeplot! Another advantage of this plot is that it can be used for both ORA and GSEA results, so we can compare more easily. We will add a custom subtitle that informs the number of DEGs that were actually annotated and included in the FEA, so anyone reviewing the plot will understand that caution must be exercised when interpreting the results.

# calculate pairwise similarities between the enriched terms
cp_go_ora <- enrichplot::pairwise_termsim(cp_go_ora)
p<- enrichplot::treeplot(cp_go_ora, 
  showCategory = 15, 
  color = "p.adjust", 
  cluster.params = list(label_words_n = 5)
)

# Add annotations (number of input genes and number of input genes with GO terms)
num_genes <- length(degs)
genes_with_GO_terms <- sum(degs %in% go_term2gene$gene)

# Print the plot with custom sub-title
p <- p + ggtitle("clusterProfiler ORA of GO terms") + labs(subtitle = paste("Input genes:", num_genes, "| Input genes with GO terms:", genes_with_GO_terms))
print(p)

There’s a lot of skin and muscle stuff, which we expect to be expressed in the blastema. As for why they are dysregulated? This is a dummy experiment from public RNAseq, with poor replication, and may not even be the right experiment type for this question, so let’s not hope for too many clear answers :-)

4.2 clusterProfiler GSEA of KEGG terms

The GSEA function is the ‘universal’ GSEA option that accepts the TERM2GENE and TERM2NAME files we have just created.

Let’s review the help page:

?clusterProfiler::GSEA

Recall from our clusterProfiler session with human data that we needed to add nPermSimple = 10000 to avoid a warning about “unbalanced (positive and negative) gene-level statistic value” and reduce eps to zero to avoid a warning about obtaining better P value estimates . Let’s do this from the start.

cp_kegg_gsea <- GSEA(
  geneList = ranked_vector, 
  exponent = 1, 
  minGSSize = 10, 
  maxGSSize = 500, 
  eps = 0,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  TERM2GENE = kegg_term2gene, 
  TERM2NAME = kegg_term2name, 
  seed = 123, 
  by = "fgsea",
  nPermSimple = 10000
)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.16% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
cp_kegg_gsea
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism     UNKNOWN 
## #...@setType      UNKNOWN 
## #...@geneList     Named num [1:24419] 10.6 10.58 10.28 10.07 9.79 ...
##  - attr(*, "names")= chr [1:24419] "AMEX60DD020778" "AMEX60DD020772" "AMEX60DD020773" "AMEX60DD005693" ...
## #...nPerm     
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...29 enriched terms found
## 'data.frame':    29 obs. of  11 variables:
##  $ ID             : chr  "map04260" "map04530" "map00601" "map05414" ...
##  $ Description    : chr  "Cardiac muscle contraction" "Tight junction" "Glycosphingolipid biosynthesis - lacto and neolacto series" "Dilated cardiomyopathy" ...
##  $ setSize        : int  94 262 52 123 37 32 91 180 163 112 ...
##  $ enrichmentScore: num  0.752 0.639 0.807 0.685 0.826 ...
##  $ NES            : num  1.82 1.6 1.86 1.68 1.84 ...
##  $ pvalue         : num  1.50e-08 1.27e-08 2.26e-07 1.62e-06 4.94e-06 ...
##  $ p.adjust       : num  2.60e-06 2.60e-06 2.60e-05 1.40e-04 3.41e-04 ...
##  $ qvalue         : num  2.08e-06 2.08e-06 2.09e-05 1.12e-04 2.74e-04 ...
##  $ rank           : num  2313 3341 2502 3172 727 ...
##  $ leading_edge   : chr  "tags=29%, list=9%, signal=26%" "tags=23%, list=14%, signal=20%" "tags=44%, list=10%, signal=40%" "tags=35%, list=13%, signal=31%" ...
##  $ core_enrichment: chr  "AMEX60DD021112/AMEX60DD012164/AMEX60DD008613/AMEX60DD021111/AMEX60DD025304/AMEX60DD054502/AMEX60DD004521/AMEX60"| __truncated__ "AMEX60DD021112/AMEX60DD055382/AMEX60DD012164/AMEX60DD021111/AMEX60DD025304/AMEX60DD048585/AMEX60DD054502/AMEX60"| __truncated__ "AMEX60DD004640/AMEX60DD013837/AMEX60DD004647/AMEX60DD004644/AMEX60DD041117/AMEX60DD004651/AMEX60DD012715/AMEX60"| __truncated__ "AMEX60DD012164/AMEX60DD009362/AMEX60DD008613/AMEX60DD009360/AMEX60DD025304/AMEX60DD009431/AMEX60DD054502/AMEX60"| __truncated__ ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z

29 enriched terms.

Let’s treeplot!

# calculate pairwise similarities between the enriched terms
cp_kegg_gsea <- enrichplot::pairwise_termsim(cp_kegg_gsea)
p<- enrichplot::treeplot(cp_kegg_gsea, 
  showCategory = 15, 
  color = "p.adjust", 
  cluster.params = list(label_words_n = 5)
)

# Add annotations (number of input genes and number of input genes with GO terms)
# Use background since all genes for ranked are in background
num_genes <- length(background)
genes_with_kegg_terms <- sum(background %in% kegg_term2gene$gene)

# Print the plot with custom sub-title
p <- p + ggtitle("clusterProfiler GSEA of KEGG Pathways") + labs(subtitle = paste("Input genes:", num_genes, "| Input genes with KEGG pathway terms:", genes_with_kegg_terms))
print(p)

Some muscle stuff, some cull junction stuff, and some infection-related terms. This can be common in FEA, many genes involved in infection responses are also part of broader stress response pathways. These genes may be activated under different conditions, such as environmental stress, tissue injury, or other disruptions to homeostasis, which are common in various types of experiments. Pathways related to immune responses can also be interconnected with pathways controlling inflammation, wound healing, and metabolic processes. As a result, infection-related pathways can appear in enrichment analysis even when the experimental conditions don’t directly involve infection. This does not mean the result is spurious - it just requires that you exercise pragmatism, employ a basic understanding of the statistical approach, and commit to interpreting the results in the context of your experiment. Remember that the FEA results are to bring a large list of genes down to a high level overview to help guide further investigation rather than give a clear answer to your experiment.

I favour a volcano plot for GSEA, so we can see positive vs negative NES. This is part of ggplot, not enrichplot, where the volplot is only for ORA.

p<- ggplot(cp_kegg_gsea@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) + 
  geom_point(alpha = 0.7, size = 2) +  # Adjust point size
  scale_color_gradient(low = "blue", high = "red") +  # Color by p.adjust values
  theme_minimal() + 
  labs(title = "clusterProfiler GSEA of KEGG Pathways", 
       x = "Enrichment Score (NES)", 
       y = "-log10(Adjusted P-value)",
       color = "Adjusted P-value") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels for readability
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +  # Add vertical line at x=0
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +  # Add horizontal line at p=0.05 cutoff
  geom_text(aes(label = Description), 
            hjust = 0.5, 
            vjust = -0.5,  # Move labels higher off the points
            size = 3, 
            check_overlap = TRUE, 
            alpha = 0.7)  # Add labels for each pathway term

print(p)

Interesting that all terms except one have leading edge genes that are upregulated in distal compared to proximal (the reference level)!

5. Reformat annotation files for WebGestaltR GO and KEGG analysis

GMT files must have .gmt suffix and description files must have .des suffix.

5.1 Create GMT objects

The GMT files need links for all of the terms, so that we can have that handy link-out to enriched terms from the HTML report we experienced in the last activity. This is actually pretty simple to do thanks to consistent URLs.

For GO, we just need to paste the term ID to the end of this link https://www.ebi.ac.uk/QuickGO/term/

And for KEGG, we need to paste the map ID to the end of this link: https://www.genome.jp/dbget-bin/www_bget?

5.1.1 GO GMT

Note that in the below code, the first command is identical the one that created the go_term2gene object earlier in the notebook. we could just use the go_term2gene object and skip step of the below code, using go_term2gene as input for step 2 rather than go_data. This code duplication is intentional, so that this code chunk is standalone for re-use and re-purpose.

# Step 1: Extract relevant columns (GO terms and gene IDs) from eggnog_anno
go_data <- eggnog_anno %>% # use the emapper annotations for axolotl 
  dplyr::select(GOs, `#query`) %>%  # Select the GO terms and the gene IDs
  dplyr::filter(GOs != "-") %>%  # Filter out rows where the GO ID is missing ("-")
  separate_rows(GOs, sep = ",") %>%  # Split comma-delimited list of GO terms into separate rows
  dplyr::select(GOs, `#query`) %>%  # Keep GO terms and gene IDs columns
  distinct() %>%  # Remove duplicates
  drop_na()  # Drop any rows with missing values

# Rename columns to match the format (term, gene)
colnames(go_data) <- c("term", "gene")

# Step 2: Create external links for each GO term (link to QuickGO)
go_data <- go_data %>%
  dplyr::mutate(external_link = paste0("https://www.ebi.ac.uk/QuickGO/term/", term))

# Step 3: Group genes by GO term and concatenate gene list by tab so all genes per term are on the same row
go_term_grouped <- go_data %>%
  dplyr::group_by(term) %>%
  dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")

# Step 4: Add the external link for each GO term
go_term_grouped <- go_term_grouped %>%
  dplyr::left_join(go_data %>% dplyr::select(term, external_link) %>% distinct(), by = "term")

# Step 5: Create the final GMT format entry (term ID, external link, and gene list)
go_gmt <- go_term_grouped %>%
  dplyr::mutate(gmt_entry = paste(term, external_link, genes, sep = "\t")) %>%
  dplyr::select(gmt_entry)

# Save to file
write.table(go_gmt, file = "Axolotl_GO.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

# check only the first line (the lines can be long re all genes per term!:
cat(go_gmt$gmt_entry[1:1], sep = "\n")
## GO:0000001   https://www.ebi.ac.uk/QuickGO/term/GO:0000001   AMEX60DD030064  AMEX60DD044815  AMEX60DD044816  AMEX60DD047254  AMEX60DDU001033143

5.1.2 KEGG GMT

As above, for clarity we have avoided using the TERM2GENE object to ensure this code chunk can be standalone.

# Step 1: Extract relevant columns (KEGG Pathway and gene IDs) from eggnog_anno
kegg_data <- eggnog_anno %>%
  dplyr::select(KEGG_Pathway, `#query`) %>%  # Select the KEGG Pathway and gene ID columns
  dplyr::filter(grepl("map", KEGG_Pathway)) %>%  # Keep only rows where KEGG_Pathway contains 'map'
  separate_rows(KEGG_Pathway, sep = ",") %>%  # Split multiple pathways into separate rows
  dplyr::mutate(term = gsub("map:", "", KEGG_Pathway)) %>%  # Remove the "map:" prefix
  dplyr::filter(grepl("^map", term)) %>%  # Filter again to keep only 'map' pathways (after removing "map:")
  dplyr::select(term, `#query`) %>%  # Select the KEGG Pathway and gene ID columns
  distinct() %>%  # Remove duplicate rows
  drop_na()  # Remove rows with missing values

# Ensure the column is properly named
colnames(kegg_data)[colnames(kegg_data) == "#query"] <- "gene"

# Step 2: Create external links for each KEGG pathway
kegg_data <- kegg_data %>%
  dplyr::mutate(external_link = paste0("https://www.genome.jp/dbget-bin/www_bget?", term))

# Step 3: Group by KEGG pathway term and concatenate the gene list
kegg_term_grouped <- kegg_data %>%
  dplyr::group_by(term) %>%
  dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")

# Step 4: Add the external link for each KEGG pathway
kegg_term_grouped <- kegg_term_grouped %>%
  dplyr::left_join(kegg_data %>% dplyr::select(term, external_link) %>% distinct(), by = "term")

# Step 5: Create the final GMT format entry (Pathway, External Link, Genes)
kegg_gmt <- kegg_term_grouped %>%
  dplyr::mutate(gmt_entry = paste(term, external_link, genes, sep = "\t")) %>%
  dplyr::select(gmt_entry)

# Save to file
write.table(kegg_gmt, file = "Axolotl_KEGG-pathways.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

# check only the first line (the lines can be long re all genes per term!:
cat(kegg_gmt$gmt_entry[1:1], sep = "\n")
## map00010 https://www.genome.jp/dbget-bin/www_bget?map00010   AMEX60DD012631  AMEX60DD014039  AMEX60DD014838  AMEX60DD016386  AMEX60DD016751  AMEX60DD017614  AMEX60DD018164  AMEX60DD019376  AMEX60DD019952  AMEX60DD021378  AMEX60DD021709  AMEX60DD023861  AMEX60DD025157  AMEX60DD026361  AMEX60DD026817  AMEX60DD026823  AMEX60DD026859  AMEX60DD027319  AMEX60DD027320  AMEX60DD027323  AMEX60DD027845  AMEX60DD027849  AMEX60DD028369  AMEX60DD028426  AMEX60DD029681  AMEX60DD029699  AMEX60DD030349  AMEX60DD034077  AMEX60DD034482  AMEX60DD034483  AMEX60DD034486  AMEX60DD034489  AMEX60DD034490  AMEX60DD035055  AMEX60DD035238  AMEX60DD037093  AMEX60DD040517  AMEX60DD041522  AMEX60DD042903  AMEX60DD043063  AMEX60DD043064  AMEX60DD043133  AMEX60DD043180  AMEX60DD043840  AMEX60DD043995  AMEX60DD043997  AMEX60DD043999  AMEX60DD044033  AMEX60DD044195  AMEX60DD044196  AMEX60DD044197  AMEX60DD044200  AMEX60DD044203  AMEX60DD044204  AMEX60DD044205  AMEX60DD044207  AMEX60DD044208  AMEX60DD044211  AMEX60DD044356  AMEX60DD044531  AMEX60DD045685  AMEX60DD047033  AMEX60DD048142  AMEX60DD048145  AMEX60DD049834  AMEX60DD051188  AMEX60DD051785  AMEX60DD051787  AMEX60DD052311  AMEX60DD052532  AMEX60DD052851  AMEX60DD052935  AMEX60DD052945  AMEX60DD053248  AMEX60DD053741  AMEX60DD054634  AMEX60DD054731  AMEX60DD054868  AMEX60DD055296  AMEX60DD000689  AMEX60DD000771  AMEX60DD000779  AMEX60DD001324  AMEX60DD001326  AMEX60DD001357  AMEX60DD001360  AMEX60DD002915  AMEX60DD003526  AMEX60DD003880  AMEX60DD004101  AMEX60DD004681  AMEX60DD005105  AMEX60DD006557  AMEX60DD006636  AMEX60DD007076  AMEX60DD007285  AMEX60DD007334  AMEX60DD007477  AMEX60DD008417  AMEX60DD009295  AMEX60DD009574  AMEX60DD009798  AMEX60DD009801  AMEX60DD009804  AMEX60DD009919  AMEX60DD011472  AMEX60DDU001000655  AMEX60DDU001003245  AMEX60DDU001003877  AMEX60DDU001005940  AMEX60DDU001006311  AMEX60DDU001008806  AMEX60DDU001009862  AMEX60DDU001010046  AMEX60DDU001010870  AMEX60DDU001012183  AMEX60DDU001014216  AMEX60DDU001014514  AMEX60DDU001017139  AMEX60DDU001018885  AMEX60DDU001019261  AMEX60DDU001019409  AMEX60DDU001020872  AMEX60DDU001021039  AMEX60DDU001021439  AMEX60DDU001024501  AMEX60DDU001024997  AMEX60DDU001025643  AMEX60DDU001026913  AMEX60DDU001027220  AMEX60DDU001028117  AMEX60DDU001031417  AMEX60DDU001032002  AMEX60DDU001034358  AMEX60DDU001035591  AMEX60DDU001036459  AMEX60DDU001037000  AMEX60DDU001039985  AMEX60DDU001040851  AMEX60DDU001041028

5.2 Create description objects

5.2.1 GO description

The description file is identical to clusterProfiler TERM2NAME . Again, we don’t want to use any code from the creation of clusterProfiler objects to ensure this part can be used alone.

# Step 1: Get terms from ontology object created from go.obo with ontologyIndex function: 
ontology_term_names <- ontology$name 

# Step 2: Filter and separate GO terms from the annotation file
# We filter out rows where no GO terms are assigned and separate comma-delimited GO terms
go_terms <- eggnog_anno %>%
  dplyr::select(GOs, `#query`) %>%
  dplyr::filter(GOs != "-") %>%  # Keep only rows with GO terms
  separate_rows(GOs, sep = ",") %>%
  dplyr::mutate(term = GOs) %>%
  dplyr::select(term, `#query`) %>%
  distinct() %>%
  drop_na()  # Drop rows with missing values

# Step 3: Create the description by matching GO terms to their names in the ontology
go_des <- go_terms %>%
  dplyr::mutate(name = ontology_term_names[term]) %>%  # Map term to its name from the ontology
  dplyr::select(term, name) %>%  # Keep only the term and name
  distinct() %>%  # Remove duplicates
  drop_na() %>%  # Remove rows with missing values
  filter(!grepl("obsolete", name))  # Remove obsolete terms if present

# Save to file
write.table(go_des, file = "Axolotl_GO.des", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

# Check
head(go_des)

5.2.2 KEGG description file

More extra work for the sake of portability :-)

# Get columns from kegg gmt object 
kegg_gmt_columns <- kegg_gmt %>%
  separate(gmt_entry, into = c("term", "external_link", "genes"), sep = "\t")
## Warning: Expected 3 pieces. Additional pieces discarded in 381 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Create the kegg_des table by joining the pathways file with the species-specific terms from kegg_gmt
kegg_des <- kegg_pathways %>%
  dplyr::filter(term %in% kegg_gmt_columns$term) %>%
  dplyr::select(term, name) %>%
  distinct() %>%
  drop_na()

# Save to file
write.table(kegg_des, file = "Axolotl_KEGG-pathways.des", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

# Check the first few rows
head(kegg_des)

6. Run WebGestaltR ORA and GSEA with custom database files

We will use the same approach as with clusterProfiler and run ORA with GO and GSEA with KEGG for both tools.

6.1 WebGestaltR ORA of GO terms

Parameters to note for novel species:

  • organism = others
  • enrichDatabaseFile = "Axolotl_GO.gmt"
  • enrichDatabaseDescriptionFile = "Axolotl_GO.des"
outputDirectory <- "WebGestaltR_results" 
project <- "Axolotl_ORA_GO"

WebGestaltR(
    organism = "others",                          # must specify 'others' when using custom db files
    enrichMethod = "ORA",                         # Perform ORA, GSEA or NTA
    interestGene = degs,                          # gene list of interest
    referenceGene = background,                   # background genes
    enrichDatabaseFile = "Axolotl_GO.gmt",        # the custom gmt file
    enrichDatabaseDescriptionFile = "Axolotl_GO.des",   # the custom description file  
    isOutput = TRUE,                              # Set to FALSE if you don't want files saved to disk
    fdrMethod = "BH",                             # Benjamini-Hochberg multiple testing correction
    sigMethod = "fdr",                            # Significance method ('fdr' or 'top')
    fdrThr = 0.05,                                # FDR significance threshold
    minNum = 10,                                   # Minimum number of genes per category
    maxNum = 500,                                 # Maximum number of genes per category
    outputDirectory = outputDirectory,
    projectName = project
)
## ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory.
## [1] "ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory."

Open the HTML report WebGestaltR_results/Project_Axolotl_ORA_GO/Report_Axolotl_ORA_GO.html from the files pane in a browser.

Note some term similarity to what we have seen with the past 2 analyses (that’s reassuring!)

We no longer have GO Slim, as this needs to call the actual GO database, which we haven’t used.

Change the ‘Enrichment Results’ view from table to ‘Bar chart’, then try the ‘Affinity propagation’ and ‘Weighted set cover’ term clustering algorithms. ‘All’ has more terms with higher specificty, and the term redundancy has performed clustering to give fewwer terms and provide a more concise overview. It’s up to you as the researcher to decided which approach is best suited to your dataset!

Confirm that our GMT file correctly included the term link by selecting a term and clicking the hyperlink at Analyte set. Pretty neat huh :-)

6.2 WebGestaltR GSEA of KEGG Pathways

This will take slightly longer than ORA. We will set threads to 7 to speed it up as much as we can.

There is no seed parameter for WebGestaltR GSEA as there is for clusterProfiler. We can set it in R instead with set.seed().

set.seed(123)

Again we are specifying organism = "others" and providing our GMT and description file:

outputDirectory <- "WebGestaltR_results" 
project <- "Axolotl_GSEA_KEGG"

suppressWarnings({ WebGestaltR(
    organism = "others",                          # must specify 'others' when using custom db files
    enrichMethod = "GSEA",                        # Perform ORA, GSEA or NTA
    interestGene = ranked_df,                     # ranked dataframe
    enrichDatabaseFile = "Axolotl_KEGG-pathways.gmt",        # the custom gmt file
    enrichDatabaseDescriptionFile = "Axolotl_KEGG-pathways.des",   # the custom description file  
    isOutput = TRUE,                              # Set to FALSE if you don't want files saved to disk
    fdrMethod = "BH",                             # Benjamini-Hochberg multiple testing correction
    sigMethod = "fdr",                            # Significance method ('fdr' or 'top')
    fdrThr = 0.05,                                # FDR significance threshold
    minNum = 10,                                   # Minimum number of genes per category
    maxNum = 500,                                 # Maximum number of genes per category
    outputDirectory = outputDirectory,
    projectName = project,
    nThreads = 7
) })
## ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory.
## [1] "ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory."

Open the HTML report WebGestaltR_results/Project_Axolotl_GSEA_KEGG/Report_Axolotl_GSEA_KEGG.html from the files pane in a browser.

Expand ‘Job summary’ to read that “22 positive related categories and no negative related categories” are significant in this analysis. This is in contrast to the one negative category we observed when running KEGG GSEA with clusterProfiler. We expect some differences between these tools.

Compare the tabular results in this report to the treeplot we produced under code chunk treeplot CP KEGG GSEA. There are a lot of shared terms, and this is reassuring.

7. Save versions and session details

GO database version

Print the database version of GO Core Ontology used:

# Read go.obo lines
lines <- readLines("go.obo")

# Use grep to pull "data-version"
version <- grep("data-version", lines, value = TRUE)

# Print version
cat("GO version from go.obo file:", version, "\n")
## GO version from go.obo file: data-version: releases/2024-06-17

KEGG Pathways database version

The KEGG pathways file does not contain any version details within the file contents, but does have the date saved in the name of the file that was imported into this workbook. Adding the date of download was done manually, and is always recommended practice for files and databases that do not contain any date or version details.

R version and R package versions

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] enrichplot_1.26.2      WebGestaltR_0.4.6      clusterProfiler_4.14.1
##  [4] lubridate_1.9.3        forcats_1.0.0          stringr_1.5.1         
##  [7] purrr_1.0.2            tidyr_1.3.1            tibble_3.2.1          
## [10] ggplot2_3.5.1          tidyverse_2.0.0        ontologyIndex_2.12    
## [13] dplyr_1.1.4            readr_2.1.5           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2               gson_0.1.0              rlang_1.1.3            
##   [4] magrittr_2.0.3          DOSE_4.0.0              compiler_4.4.2         
##   [7] RSQLite_2.3.7           systemfonts_1.0.6       png_0.1-8              
##  [10] vctrs_0.6.5             reshape2_1.4.4          pkgconfig_2.0.3        
##  [13] crayon_1.5.2            fastmap_1.1.1           XVector_0.46.0         
##  [16] labeling_0.4.3          utf8_1.2.4              rmarkdown_2.26         
##  [19] tzdb_0.4.0              UCSC.utils_1.2.0        bit_4.0.5              
##  [22] xfun_0.43               zlibbioc_1.52.0         cachem_1.0.8           
##  [25] aplot_0.2.3             GenomeInfoDb_1.42.0     jsonlite_1.8.8         
##  [28] blob_1.2.4              highr_0.10              BiocParallel_1.40.0    
##  [31] parallel_4.4.2          R6_2.5.1                bslib_0.7.0            
##  [34] stringi_1.8.4           RColorBrewer_1.1-3      jquerylib_0.1.4        
##  [37] GOSemSim_2.32.0         iterators_1.0.14        Rcpp_1.0.12            
##  [40] knitr_1.46              ggtangle_0.0.4          R.utils_2.12.3         
##  [43] IRanges_2.40.0          Matrix_1.7-1            splines_4.4.2          
##  [46] igraph_2.0.3            timechange_0.3.0        tidyselect_1.2.1       
##  [49] qvalue_2.38.0           rstudioapi_0.16.0       yaml_2.3.8             
##  [52] doParallel_1.0.17       codetools_0.2-19        curl_5.2.1             
##  [55] doRNG_1.8.6             lattice_0.22-5          plyr_1.8.9             
##  [58] treeio_1.30.0           Biobase_2.66.0          withr_3.0.0            
##  [61] KEGGREST_1.46.0         evaluate_0.23           gridGraphics_0.5-1     
##  [64] Biostrings_2.74.0       ggtree_3.14.0           pillar_1.9.0           
##  [67] rngtools_1.5.2          whisker_0.4.1           foreach_1.5.2          
##  [70] stats4_4.4.2            ggfun_0.1.7             generics_0.1.3         
##  [73] vroom_1.6.5             S4Vectors_0.44.0        hms_1.1.3              
##  [76] tidytree_0.4.6          munsell_0.5.1           scales_1.3.0           
##  [79] apcluster_1.4.13        glue_1.7.0              lazyeval_0.2.2         
##  [82] tools_4.4.2             ggnewscale_0.5.0        data.table_1.15.4      
##  [85] fgsea_1.32.0            fs_1.6.4                fastmatch_1.1-4        
##  [88] cowplot_1.1.3           grid_4.4.2              ape_5.8                
##  [91] AnnotationDbi_1.68.0    colorspace_2.1-0        nlme_3.1-165           
##  [94] GenomeInfoDbData_1.2.13 patchwork_1.3.0         cli_3.6.2              
##  [97] fansi_1.0.6             svglite_2.1.3           gtable_0.3.5           
## [100] R.methodsS3_1.8.2       yulab.utils_0.1.8       sass_0.4.9             
## [103] digest_0.6.35           BiocGenerics_0.52.0     ggrepel_0.9.6          
## [106] ggplotify_0.1.2         farver_2.1.2            memoise_2.0.1          
## [109] htmltools_0.5.8.1       R.oo_1.26.0             lifecycle_1.0.4        
## [112] httr_1.4.7              GO.db_3.20.0            bit64_4.0.5

RStudio version

And RStudio version. Typically, we would simply run RStudio.Version() to print the version details. However, when we knit this document to HTML, the RStudio.Version() function is not available and will cause an error. So to make sure our version details are saved to our static record of the work, we will save to a file, then print the file contents back into the notebook.

# Get RStudio version information
rstudio_info <- RStudio.Version()

# Convert the version information to a string
rstudio_version_str <- paste(
  "RStudio Version Information:\n",
  "Version: ", rstudio_info$version, "\n",
  "Release Name: ", rstudio_info$release_name, "\n",
  "Long Version: ", rstudio_info$long_version, "\n",
  "Mode: ", rstudio_info$mode, "\n",
  "Citation: ", rstudio_info$citation,
  sep = ""
)

# Write the output to a text file
writeLines(rstudio_version_str, "rstudio_version.txt")
# Read the saved version information from the file
rstudio_version_text <- readLines("rstudio_version.txt")

# Print the version information to the document
rstudio_version_text
## [1] "RStudio Version Information:"                                                                                                                                                                                                                                                                           
## [2] "Version: 2023.6.1.524"                                                                                                                                                                                                                                                                                  
## [3] "Release Name: Mountain Hydrangea"                                                                                                                                                                                                                                                                       
## [4] "Long Version: 2023.06.1+524"                                                                                                                                                                                                                                                                            
## [5] "Mode: server"                                                                                                                                                                                                                                                                                           
## [6] "Citation: list(title = \"RStudio: Integrated Development Environment for R\", author = list(list(given = \"Posit team\", family = NULL, role = NULL, email = NULL, comment = NULL)), organization = \"Posit Software, PBC\", address = \"Boston, MA\", year = \"2023\", url = \"http://www.posit.co/\")"

8. Knit workbook to HTML

The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.

On the editor pane toolbar, under Preview, select Knit to HTML.

If you have already run Preview, you will see Knit instead of Preview.

The HTML file will be saved in the same directory as the notebook, and with the same filename, but the .Rmd prefix will be replaced by .html. The knit HTML will typically open automatically once complete. If you receive a popup blocker error, click cancel, and in the Files pane of RStudio, single click the gprofiler.html file and select View in Web Browser.

Note that the notebook will only successfully knit if there are no errors in the code. You can ‘preview’ HTML with code errors.