Ensure the ‘workshop’ directory is your current working directory:
getwd()
## [1] "/home/user2/workshop"
emapper axolotl annotation file, GO ontology
file and KEGG Pathways fileThe raw reads were processed to counts matrix and DE results broadly
following https://github.com/Sydney-Informatics-Hub/RNASeq-DE
using the reference genome from www.axolotl-omics.org. Differential gene
expression analysis was performed in R with DESeq2 v
1.46.0, filtering for genes with at least a count of 10 in at least 2
samples. The data comprises 2 groups (proximal blastema and distal
blastema) and 2 replicates per group.
A predicted proteome was created by extracting the predicted peptide
sequences from the GTF file from www.axolotl-omics.org then filtering
for longest isoform per gene with AGAT v 1.4.0. The
predicated proteome was annotated against GO and KEGG with
eggNOG emapper v 2.1.12.
The emapper annotation output contains results against a
number of databases including GO and KEGG, which we will focus on
today.
eggnog_anno <- read_tsv("AmexG_v6.0-DD.emapper.annotations.txt", show_col_types = FALSE)
head(eggnog_anno)
The raw annotation file provides us with ‘term ID to gene ID’ mappings for our species. We also need ‘term ID to term description’ mappings. These files are not organisms specific: we will extract only the terms that are found within our custom species annotation, to make our organism specific version.
For GO, we will use the GO ‘core’ ontology file, downloaded from https://purl.obolibrary.org/obo/go.obo and included in
the data files you downloaded to workshop directory
earlier.
We will use the ontologyIndex package to retrieve
ontology info and save to an object named ontology for
later use creating the required custom database files for
clusterProfiler and WebGestaltR.
ontology <- ontologyIndex::get_ontology(file = "go.obo",
propagate_relationships = "is_a", #propagates relationships from parent terms to children
extract_tags = "everything", # retrieve all available details for each term
merge_equivalent_terms = TRUE) # avoid unecessary redundancy
For KEGG, we have both map and ko IDs in
our emapper annotation.
ko terms (https://www.genome.jp/kegg/ko.html) represent
orthologous groups of genes, which are assigned based on evolutionary
relationships and functional similarity, so can provide more precise
functional categorisation which can be particularly useful when working
with novel species which lack curate dpathway information.
map terms (https://www.genome.jp/kegg/pathway.html) are manually
drawn pathway maps representing KEGG database of molecular interaction,
reaction and relation networks for: Metabolism, Genetic Information
Processing, Environmental Information Processing, Cellular Processes,
Organismal Systems, Human Diseases, and Drug Development.
Today we will be working with the map terms due to
database download restrictions.
Free access to the KEGG FTP downloads requires an academic subscription, to which you must confirm to be the “only user of the KEGG FTP Data”. The pathways list was available freely. As a single user, you can request academic access here https://www.pathway.jp/en/academic.html.
There is an alternate method for using ko IDs, which
uses the KEGG ontology information available through the
clusterProfiler functions enrichKEGG and
gseKEGG. An example of the R code can be found here https://github.com/dadrasarmin/enrichment_analysis_for_non_model_organism.
However, this poses a problem: as the novel species gene IDs are
assigned to KEGG terms, gene:name duplicate records are identified, and
duplicates must be removed in order to avoid errors running the
enrichment. This loss of data will have a real impact on the results,
with the importance of some terms being underestimated.
Given these considerations, we will proceed with map
pathway terms :-)
The KEGG map pathway list was downloaded from https://rest.kegg.jp/list/pathway and downloaded to your
workshop directory.
kegg_pathways <- read.table("kegg_pathways_2024-11-13.txt", header = FALSE, sep = "\t", col.names = c("term", "name"))
head(kegg_pathways)
Load the DE results file for axolotl:
de_matrix <- read_tsv("axolotl_DE_results.txt", col_names = TRUE, show_col_types = FALSE)
head(de_matrix)
Recall from the last 2 activities that clusterProfiler
requires a vector object for GSEA, while WebGestaltR
requires a 2-column dataframe. Since we intend to use both tools, let’s
create both now:
# Create ranked vector for clusterProfiler GSEA
ranked_vector <- setNames(de_matrix$log2FoldChange, de_matrix$geneID) %>% sort(decreasing = TRUE) # Named vector
# check
head(ranked_vector)
## AMEX60DD020778 AMEX60DD020772 AMEX60DD020773 AMEX60DD005693 AMEX60DD020780
## 10.596655 10.583791 10.281349 10.067803 9.789798
## AMEX60DD030633
## 9.691689
tail(ranked_vector)
## AMEX60DD051201 AMEX60DD028124 AMEX60DD044496 AMEX60DD020182 AMEX60DD053589
## -7.886591 -7.939796 -8.070068 -8.255430 -9.226586
## AMEX60DD007432
## -9.918767
# Create ranked dataframe for WebGestaltR GSEA# extract ranked dataframe
ranked_df <- de_matrix %>%
arrange(desc(log2FoldChange)) %>%
dplyr::select(geneID, log2FoldChange)
# check
head(ranked_df)
tail(ranked_df)
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5.
The matrix has already filtered out genes with very low counts so we take all genes present as the background.
# Filter for DEGs and save gene IDs as vector
degs <- de_matrix %>%
filter(padj <= 0.01 & abs(log2FoldChange) >= 1.5) %>%
pull(geneID) # Extract
# Extract the background gene list vector
background <- de_matrix %>%
pull(geneID)
# Check number of genes:
cat("Number of DEGs:", length(degs), "\n") # Number of DEGs
## Number of DEGs: 247
cat("Number of background genes:", length(background), "\n") # Number of background genes
## Number of background genes: 24419
# Check format:
head(degs)
## [1] "AMEX60DD000080" "AMEX60DD000147" "AMEX60DD001144" "AMEX60DD001307"
## [5] "AMEX60DD001377" "AMEX60DD001828"
head(background)
## [1] "AMEX60DD000001" "AMEX60DD000002" "AMEX60DD000003" "AMEX60DD000004"
## [5] "AMEX60DD000005" "AMEX60DD000006"
Note the large drop in gene numbers: 100K in GTF, 48K in predicted proteome, 24K expressed in the blastema! By reducing the number of background genes to what are expressed in the studied tissue, we can reduce falsely inflated P values and false positives within our list of enriched terms.
Saving any outputs generated from R code is vital to reproducibility! You should include all analysed gene lists within the supplementary materials of your manuscript.
# Save DEGs
write.table(degs, file = "Axolotl_DEGs.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
# Save background
write.table(background, file = "Axolotl_background.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
# Save ranked
write.table(ranked_df, file = "Axolotl_rankedFC.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
clusterProfiler GO and
KEGG analysisNow we have annotation files and gene lists, we will bring those together to create the custom database files required for R FEA!
These are 2 column text files with the term ID (one per line) alongside the ID of the gene that maps to the term. A gene can map to many terms and thus be present on multiple lines. A term can be mapped to more than one gene and thus be present on many lines.
Check the column names of the emapper annotation file so
we know which are the GO and KEGG column names:
colnames(eggnog_anno)
## [1] "#query" "seed_ortholog" "evalue" "score"
## [5] "eggNOG_OGs" "max_annot_lvl" "COG_category" "Description"
## [9] "Preferred_name" "GOs" "EC" "KEGG_ko"
## [13] "KEGG_Pathway" "KEGG_Module" "KEGG_Reaction" "KEGG_rclass"
## [17] "BRITE" "KEGG_TC" "CAZy" "BiGG_Reaction"
## [21] "PFAMs"
We need GOs and KEGG_Pathway columns.
Next, we will extract the GO IDs from the emapper
annotation file, and wrangle into the correct format for
clusterProfiler TERM2GENE.
There are several steps to this - comments have been included to outline what each step is doing.
go_term2gene <- eggnog_anno %>%
dplyr::select(GOs, `#query`) %>% # select the GO column and the query column (axolotl gene ID)
dplyr::filter(GOs != "-") %>% # filter out rows where the GO ID is "-" ie no GO annotation for this gene
separate_rows(GOs, sep = ",") %>% # split comma-delimited list of many GO terms for a gene into separate rows
dplyr::select(GOs, `#query`) %>% # keep the GO and query columns
distinct() %>% # remove any duplicate rows
drop_na() # remove rows with missing values
# Rename columns to match desired output format
colnames(go_term2gene) <- c("term", "gene")
# Save to file
write.table(go_term2gene, file = "Axolotl_GO_term2gene.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
# Number of rows:
cat("Number of GO term2gene rows:", nrow(go_term2gene), "\n")
## Number of GO term2gene rows: 2839193
# Check first few rows
head(go_term2gene)
Here we use the same process as we did above for GO (colum name
GOs), selecting a different column name for KEGG
(KEGG_Pathway).
kegg_term2gene <- eggnog_anno %>%
dplyr::select(KEGG_Pathway, `#query`) %>% # Select the relevant columns
dplyr::filter(grepl("map", KEGG_Pathway)) %>% # Keep only rows where KEGG_Pathway contains 'map'
separate_rows(KEGG_Pathway, sep = ",") %>% # Split multiple pathways into separate rows
dplyr::mutate(term = gsub("map:", "", KEGG_Pathway)) %>% # Remove the "map:" prefix
dplyr::filter(grepl("^map", term)) %>% # Filter again to make sure we only have map pathways (after removing "map:")
dplyr::select(term, `#query`) %>% # Select the pathway (term) and gene columns
distinct() %>% # Remove duplicate rows
drop_na() # Remove rows with missing values
# Rename columns to match desired output format
colnames(kegg_term2gene) <- c("term", "gene")
# Save to file
write.table(kegg_term2gene, file = "Axolotl_KEGG-Pathways_term2gene.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
cat("Number of KEGG term2gene rows:", nrow(kegg_term2gene), "\n")
## Number of KEGG term2gene rows: 59305
# View result to check
head(kegg_term2gene)
Now we will assign term descriptions to term IDs and create our
TERM2NAME files.
This may take a few moments to run. It will use the
ontology object we created earlier from the
go.obo file.
# Create term to name table, removing duplicates, missing values and obsolete terms
go_term2name <- go_term2gene %>% # only keep terms that are in our term2gene object (ie, mapped to axolotl)
mutate(name = ontology$name[term]) %>%
dplyr::select(term, name) %>%
distinct() %>%
drop_na() %>%
filter(!grepl("obsolete", name))
# Save to file
write.table(go_term2name, file = "Axolotl_GO_term2name.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
# Show the first few lines
head(go_term2name)
The KEGG Pathways file was available for download in the correct
format for TERM2NAME.
head(kegg_pathways)
Let’s restrict it to include the terms relevant to our analysis, and then print that to a file for reproducibility.
kegg_term2name <- kegg_pathways %>%
dplyr::filter(term %in% kegg_term2gene$term) %>% # Only keep terms that are in kegg_term2gene
distinct() %>% # Remove duplicate entries
drop_na() # Remove rows with missing values
# Save the result to a file
write.table(kegg_term2name, file = "Axolotl_KEGG_term2name.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
# Check first few rows
head(kegg_term2name)
How much of our proteome was annotated? What about our DEGs and background?
Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
Count the number of GO terms found within the genome, and the number of genes with GO annotations:
go_total_terms<-nrow(go_term2gene)
print(paste("Total annotations to GO:", go_total_terms))
## [1] "Total annotations to GO: 2839193"
go_unique_genes <- length(unique(go_term2gene$gene))
print(paste("Number of unique genes with 1 or more annotation terms:", go_unique_genes))
## [1] "Number of unique genes with 1 or more annotation terms: 21373"
And for KEGG:
kegg_total_terms<-nrow(kegg_term2gene)
print(paste("Total annotations to KEGG Pathways:", kegg_total_terms))
## [1] "Total annotations to KEGG Pathways: 59305"
kegg_unique_genes <- length(unique(kegg_term2gene$gene))
print(paste("Number of unique genes with 1 or more annotation terms:", kegg_unique_genes))
## [1] "Number of unique genes with 1 or more annotation terms: 12226"
47,196 putative axolotl proteins were annotated. That’s around 1/4 of our predicted proteins mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it’s still unpleasant to face :-)
What of the genes in our gene list specifically? We have an uncurated proteome, yet the genes in our input matrix were expressed at a meaningful level within axolotl, so these may actually have a higher annotation percentage than all genes in the proteome.
# Filter the term2gene table to only include genes in the background gene list
go_filtered_term2gene <- go_term2gene %>% filter(gene %in% background)
# Count the number of unique background genes with at least one GO term
unique_genes_with_go <- go_filtered_term2gene %>% distinct(gene) %>% nrow()
# Calculate the percentage of background genes that have GO annotations
percent_go_unique <- (unique_genes_with_go / length(background)) * 100
# Print results
cat("Number of input genes with GO annotations:", unique_genes_with_go, "(",percent_go_unique,"%)\n")
## Number of input genes with GO annotations: 15072 ( 61.72243 %)
# Filter the term2gene table to only include genes in the background gene list
kegg_filtered_term2gene <- kegg_term2gene %>% filter(gene %in% background)
# Count the number of unique background genes with at least one GO term
unique_genes_with_kegg <- kegg_filtered_term2gene %>% distinct(gene) %>% nrow()
# Calculate the percentage of background genes that have GO annotations
percent_kegg_unique <- (unique_genes_with_kegg / length(background)) * 100
# Print results
cat("Number of input genes with KEGG Pathways annotations:", unique_genes_with_kegg, "(",percent_kegg_unique,"%)\n")
## Number of input genes with KEGG Pathways annotations: 8000 ( 32.76137 %)
As expected, the annotation % is higher for expressed genes than all predicted genes, and very much higher than the GTF of 99,088 predicted gene models (!!!) with an annotation rate of 21.6%.
This highlights a major caveat when performing FEA on non-model species: the results are only as good as the annotations behind them. Therefore, all results must be interpreted with caution. For many novel (and under-funded) species, there are little opportunities (at present) to improve the annotation. Some in-silico predicted genes appear to be highly expressed and significantly regulated yet have no significant similarity to anything in the non-redundant nucleotide or protein databases. When working with datasets like this, it is critical to explore those individual genes through other methods, in addition to trying to garner some higher level overview such as we aim to obtain from FEA. Hopefully, recent advances in AI protein modelling can help provide insights into the functions of these novel genes.
For the axolotl with only 22% of predicted genes annotated, its clear that the in-silico gene predictions within the GTF file require much curation!
clusterProfiler universal FEA functions
enricher and GSEAIn the interest of time, and to try and cover as many options as possible, let’s do ORA with GO and GSEA with KEGG for both tools.
The enricher function is the ‘universal’ ORA option that
accepts the TERM2GENE and TERM2NAME files we
have just created.
Let’s review the help page:
?clusterProfiler::enricher
There are parameters for both adjusted P value and q value. Terms must pass all thresholds (unadjusted P, adjusted P, and q value) so the important filter will be the most stringent test applied. Let’s go with BH and 0.05 which we have used regularly within this workshop and are fairly common choices in the field.
we need to provide term2gene and term2name, and don’t specify an organism.
cp_go_ora <- enricher(
gene = degs,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
universe = background,
minGSSize = 10,
maxGSSize = 500,
TERM2GENE = go_term2gene,
TERM2NAME = go_term2name
)
cp_go_ora
## #
## # over-representation test
## #
## #...@organism UNKNOWN
## #...@ontology UNKNOWN
## #...@gene chr [1:247] "AMEX60DD000080" "AMEX60DD000147" "AMEX60DD001144" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...91 enriched terms found
## 'data.frame': 91 obs. of 12 variables:
## $ ID : chr "GO:0048821" "GO:0070268" "GO:0031424" "GO:0019317" ...
## $ Description : chr "erythrocyte development" "cornification" "keratinization" "fucose catabolic process" ...
## $ GeneRatio : chr "7/145" "7/145" "7/145" "5/145" ...
## $ BgRatio : chr "45/15072" "46/15072" "48/15072" "18/15072" ...
## $ RichFactor : num 0.156 0.152 0.146 0.278 0.278 ...
## $ FoldEnrichment: num 16.2 15.8 15.2 28.9 28.9 ...
## $ zScore : num 10.04 9.92 9.68 11.66 11.66 ...
## $ pvalue : num 2.20e-07 2.58e-07 3.49e-07 5.96e-07 5.96e-07 ...
## $ p.adjust : num 0.000266 0.000266 0.000266 0.000266 0.000266 ...
## $ qvalue : num 0.000242 0.000242 0.000242 0.000242 0.000242 ...
## $ geneID : chr "AMEX60DD002868/AMEX60DD025537/AMEX60DD026264/AMEX60DD026267/AMEX60DD032898/AMEX60DD032958/AMEX60DD032960" "AMEX60DD004554/AMEX60DD010118/AMEX60DD016775/AMEX60DD038318/AMEX60DD039599/AMEX60DD039603/AMEX60DD039606" "AMEX60DD004554/AMEX60DD010118/AMEX60DD016775/AMEX60DD038318/AMEX60DD039599/AMEX60DD039603/AMEX60DD039606" "AMEX60DD004640/AMEX60DD004644/AMEX60DD004647/AMEX60DD004651/AMEX60DD017241" ...
## $ Count : int 7 7 7 5 5 5 16 5 5 7 ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
91 significantly enriched terms at P.adj < 0.05.
Look at the geneRatio column: our gene list object degs
has 247 genes, but the tool has applied the input size as 145 - this is
because it is automatically discarding any that do not have
annotations.
Results would be the same if we instead used
annotated_degs object.
Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
Save the results to a text file:
file <- "Axolotl_clusterProfiler_GO_ORA_results.tsv"
write.table(cp_go_ora, file, sep = "\t", quote = FALSE, row.names = FALSE)
Let’s visualise with one of my favourite enrichplot
plots, the treeplot! Another advantage of this plot is that it can be
used for both ORA and GSEA results, so we can compare more easily. We
will add a custom subtitle that informs the number of DEGs that were
actually annotated and included in the FEA, so anyone reviewing the plot
will understand that caution must be exercised when interpreting the
results.
# calculate pairwise similarities between the enriched terms
cp_go_ora <- enrichplot::pairwise_termsim(cp_go_ora)
p<- enrichplot::treeplot(cp_go_ora,
showCategory = 15,
color = "p.adjust",
cluster.params = list(label_words_n = 5)
)
# Add annotations (number of input genes and number of input genes with GO terms)
num_genes <- length(degs)
genes_with_GO_terms <- sum(degs %in% go_term2gene$gene)
# Print the plot with custom sub-title
p <- p + ggtitle("clusterProfiler ORA of GO terms") + labs(subtitle = paste("Input genes:", num_genes, "| Input genes with GO terms:", genes_with_GO_terms))
print(p)
There’s a lot of skin and muscle stuff, which we expect to be expressed in the blastema. As for why they are dysregulated? This is a dummy experiment from public RNAseq, with poor replication, and may not even be the right experiment type for this question, so let’s not hope for too many clear answers :-)
The GSEA function is the ‘universal’ GSEA option that
accepts the TERM2GENE and TERM2NAME files we
have just created.
Let’s review the help page:
?clusterProfiler::GSEA
Recall from our clusterProfiler session with human data
that we needed to add nPermSimple = 10000 to avoid a
warning about “unbalanced (positive and negative) gene-level statistic
value” and reduce eps to zero to avoid a warning about
obtaining better P value estimates . Let’s do this from the start.
cp_kegg_gsea <- GSEA(
geneList = ranked_vector,
exponent = 1,
minGSSize = 10,
maxGSSize = 500,
eps = 0,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
TERM2GENE = kegg_term2gene,
TERM2NAME = kegg_term2name,
seed = 123,
by = "fgsea",
nPermSimple = 10000
)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.16% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
cp_kegg_gsea
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism UNKNOWN
## #...@setType UNKNOWN
## #...@geneList Named num [1:24419] 10.6 10.58 10.28 10.07 9.79 ...
## - attr(*, "names")= chr [1:24419] "AMEX60DD020778" "AMEX60DD020772" "AMEX60DD020773" "AMEX60DD005693" ...
## #...nPerm
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...29 enriched terms found
## 'data.frame': 29 obs. of 11 variables:
## $ ID : chr "map04260" "map04530" "map00601" "map05414" ...
## $ Description : chr "Cardiac muscle contraction" "Tight junction" "Glycosphingolipid biosynthesis - lacto and neolacto series" "Dilated cardiomyopathy" ...
## $ setSize : int 94 262 52 123 37 32 91 180 163 112 ...
## $ enrichmentScore: num 0.752 0.639 0.807 0.685 0.826 ...
## $ NES : num 1.82 1.6 1.86 1.68 1.84 ...
## $ pvalue : num 1.50e-08 1.27e-08 2.26e-07 1.62e-06 4.94e-06 ...
## $ p.adjust : num 2.60e-06 2.60e-06 2.60e-05 1.40e-04 3.41e-04 ...
## $ qvalue : num 2.08e-06 2.08e-06 2.09e-05 1.12e-04 2.74e-04 ...
## $ rank : num 2313 3341 2502 3172 727 ...
## $ leading_edge : chr "tags=29%, list=9%, signal=26%" "tags=23%, list=14%, signal=20%" "tags=44%, list=10%, signal=40%" "tags=35%, list=13%, signal=31%" ...
## $ core_enrichment: chr "AMEX60DD021112/AMEX60DD012164/AMEX60DD008613/AMEX60DD021111/AMEX60DD025304/AMEX60DD054502/AMEX60DD004521/AMEX60"| __truncated__ "AMEX60DD021112/AMEX60DD055382/AMEX60DD012164/AMEX60DD021111/AMEX60DD025304/AMEX60DD048585/AMEX60DD054502/AMEX60"| __truncated__ "AMEX60DD004640/AMEX60DD013837/AMEX60DD004647/AMEX60DD004644/AMEX60DD041117/AMEX60DD004651/AMEX60DD012715/AMEX60"| __truncated__ "AMEX60DD012164/AMEX60DD009362/AMEX60DD008613/AMEX60DD009360/AMEX60DD025304/AMEX60DD009431/AMEX60DD054502/AMEX60"| __truncated__ ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
29 enriched terms.
Let’s treeplot!
# calculate pairwise similarities between the enriched terms
cp_kegg_gsea <- enrichplot::pairwise_termsim(cp_kegg_gsea)
p<- enrichplot::treeplot(cp_kegg_gsea,
showCategory = 15,
color = "p.adjust",
cluster.params = list(label_words_n = 5)
)
# Add annotations (number of input genes and number of input genes with GO terms)
# Use background since all genes for ranked are in background
num_genes <- length(background)
genes_with_kegg_terms <- sum(background %in% kegg_term2gene$gene)
# Print the plot with custom sub-title
p <- p + ggtitle("clusterProfiler GSEA of KEGG Pathways") + labs(subtitle = paste("Input genes:", num_genes, "| Input genes with KEGG pathway terms:", genes_with_kegg_terms))
print(p)
Some muscle stuff, some cull junction stuff, and some infection-related terms. This can be common in FEA, many genes involved in infection responses are also part of broader stress response pathways. These genes may be activated under different conditions, such as environmental stress, tissue injury, or other disruptions to homeostasis, which are common in various types of experiments. Pathways related to immune responses can also be interconnected with pathways controlling inflammation, wound healing, and metabolic processes. As a result, infection-related pathways can appear in enrichment analysis even when the experimental conditions don’t directly involve infection. This does not mean the result is spurious - it just requires that you exercise pragmatism, employ a basic understanding of the statistical approach, and commit to interpreting the results in the context of your experiment. Remember that the FEA results are to bring a large list of genes down to a high level overview to help guide further investigation rather than give a clear answer to your experiment.
I favour a volcano plot for GSEA, so we can see positive vs negative
NES. This is part of ggplot, not enrichplot,
where the volplot is only for ORA.
p<- ggplot(cp_kegg_gsea@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) +
geom_point(alpha = 0.7, size = 2) + # Adjust point size
scale_color_gradient(low = "blue", high = "red") + # Color by p.adjust values
theme_minimal() +
labs(title = "clusterProfiler GSEA of KEGG Pathways",
x = "Enrichment Score (NES)",
y = "-log10(Adjusted P-value)",
color = "Adjusted P-value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for readability
geom_vline(xintercept = 0, linetype = "dashed", color = "black") + # Add vertical line at x=0
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") + # Add horizontal line at p=0.05 cutoff
geom_text(aes(label = Description),
hjust = 0.5,
vjust = -0.5, # Move labels higher off the points
size = 3,
check_overlap = TRUE,
alpha = 0.7) # Add labels for each pathway term
print(p)
Interesting that all terms except one have leading edge genes that are upregulated in distal compared to proximal (the reference level)!
WebGestaltR GO and
KEGG analysisGMT files must have .gmt suffix and description files
must have .des suffix.
The GMT files need links for all of the terms, so that we can have that handy link-out to enriched terms from the HTML report we experienced in the last activity. This is actually pretty simple to do thanks to consistent URLs.
For GO, we just need to paste the term ID to the end of this link https://www.ebi.ac.uk/QuickGO/term/
And for KEGG, we need to paste the map ID to the end of this link: https://www.genome.jp/dbget-bin/www_bget?
Note that in the below code, the first command is identical the one
that created the go_term2gene object earlier in the
notebook. we could just use the go_term2gene object and
skip step of the below code, using go_term2gene as input
for step 2 rather than go_data. This code duplication is
intentional, so that this code chunk is standalone for re-use and
re-purpose.
# Step 1: Extract relevant columns (GO terms and gene IDs) from eggnog_anno
go_data <- eggnog_anno %>% # use the emapper annotations for axolotl
dplyr::select(GOs, `#query`) %>% # Select the GO terms and the gene IDs
dplyr::filter(GOs != "-") %>% # Filter out rows where the GO ID is missing ("-")
separate_rows(GOs, sep = ",") %>% # Split comma-delimited list of GO terms into separate rows
dplyr::select(GOs, `#query`) %>% # Keep GO terms and gene IDs columns
distinct() %>% # Remove duplicates
drop_na() # Drop any rows with missing values
# Rename columns to match the format (term, gene)
colnames(go_data) <- c("term", "gene")
# Step 2: Create external links for each GO term (link to QuickGO)
go_data <- go_data %>%
dplyr::mutate(external_link = paste0("https://www.ebi.ac.uk/QuickGO/term/", term))
# Step 3: Group genes by GO term and concatenate gene list by tab so all genes per term are on the same row
go_term_grouped <- go_data %>%
dplyr::group_by(term) %>%
dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")
# Step 4: Add the external link for each GO term
go_term_grouped <- go_term_grouped %>%
dplyr::left_join(go_data %>% dplyr::select(term, external_link) %>% distinct(), by = "term")
# Step 5: Create the final GMT format entry (term ID, external link, and gene list)
go_gmt <- go_term_grouped %>%
dplyr::mutate(gmt_entry = paste(term, external_link, genes, sep = "\t")) %>%
dplyr::select(gmt_entry)
# Save to file
write.table(go_gmt, file = "Axolotl_GO.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# check only the first line (the lines can be long re all genes per term!:
cat(go_gmt$gmt_entry[1:1], sep = "\n")
## GO:0000001 https://www.ebi.ac.uk/QuickGO/term/GO:0000001 AMEX60DD030064 AMEX60DD044815 AMEX60DD044816 AMEX60DD047254 AMEX60DDU001033143
As above, for clarity we have avoided using the
TERM2GENE object to ensure this code chunk can be
standalone.
# Step 1: Extract relevant columns (KEGG Pathway and gene IDs) from eggnog_anno
kegg_data <- eggnog_anno %>%
dplyr::select(KEGG_Pathway, `#query`) %>% # Select the KEGG Pathway and gene ID columns
dplyr::filter(grepl("map", KEGG_Pathway)) %>% # Keep only rows where KEGG_Pathway contains 'map'
separate_rows(KEGG_Pathway, sep = ",") %>% # Split multiple pathways into separate rows
dplyr::mutate(term = gsub("map:", "", KEGG_Pathway)) %>% # Remove the "map:" prefix
dplyr::filter(grepl("^map", term)) %>% # Filter again to keep only 'map' pathways (after removing "map:")
dplyr::select(term, `#query`) %>% # Select the KEGG Pathway and gene ID columns
distinct() %>% # Remove duplicate rows
drop_na() # Remove rows with missing values
# Ensure the column is properly named
colnames(kegg_data)[colnames(kegg_data) == "#query"] <- "gene"
# Step 2: Create external links for each KEGG pathway
kegg_data <- kegg_data %>%
dplyr::mutate(external_link = paste0("https://www.genome.jp/dbget-bin/www_bget?", term))
# Step 3: Group by KEGG pathway term and concatenate the gene list
kegg_term_grouped <- kegg_data %>%
dplyr::group_by(term) %>%
dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")
# Step 4: Add the external link for each KEGG pathway
kegg_term_grouped <- kegg_term_grouped %>%
dplyr::left_join(kegg_data %>% dplyr::select(term, external_link) %>% distinct(), by = "term")
# Step 5: Create the final GMT format entry (Pathway, External Link, Genes)
kegg_gmt <- kegg_term_grouped %>%
dplyr::mutate(gmt_entry = paste(term, external_link, genes, sep = "\t")) %>%
dplyr::select(gmt_entry)
# Save to file
write.table(kegg_gmt, file = "Axolotl_KEGG-pathways.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# check only the first line (the lines can be long re all genes per term!:
cat(kegg_gmt$gmt_entry[1:1], sep = "\n")
## map00010 https://www.genome.jp/dbget-bin/www_bget?map00010 AMEX60DD012631 AMEX60DD014039 AMEX60DD014838 AMEX60DD016386 AMEX60DD016751 AMEX60DD017614 AMEX60DD018164 AMEX60DD019376 AMEX60DD019952 AMEX60DD021378 AMEX60DD021709 AMEX60DD023861 AMEX60DD025157 AMEX60DD026361 AMEX60DD026817 AMEX60DD026823 AMEX60DD026859 AMEX60DD027319 AMEX60DD027320 AMEX60DD027323 AMEX60DD027845 AMEX60DD027849 AMEX60DD028369 AMEX60DD028426 AMEX60DD029681 AMEX60DD029699 AMEX60DD030349 AMEX60DD034077 AMEX60DD034482 AMEX60DD034483 AMEX60DD034486 AMEX60DD034489 AMEX60DD034490 AMEX60DD035055 AMEX60DD035238 AMEX60DD037093 AMEX60DD040517 AMEX60DD041522 AMEX60DD042903 AMEX60DD043063 AMEX60DD043064 AMEX60DD043133 AMEX60DD043180 AMEX60DD043840 AMEX60DD043995 AMEX60DD043997 AMEX60DD043999 AMEX60DD044033 AMEX60DD044195 AMEX60DD044196 AMEX60DD044197 AMEX60DD044200 AMEX60DD044203 AMEX60DD044204 AMEX60DD044205 AMEX60DD044207 AMEX60DD044208 AMEX60DD044211 AMEX60DD044356 AMEX60DD044531 AMEX60DD045685 AMEX60DD047033 AMEX60DD048142 AMEX60DD048145 AMEX60DD049834 AMEX60DD051188 AMEX60DD051785 AMEX60DD051787 AMEX60DD052311 AMEX60DD052532 AMEX60DD052851 AMEX60DD052935 AMEX60DD052945 AMEX60DD053248 AMEX60DD053741 AMEX60DD054634 AMEX60DD054731 AMEX60DD054868 AMEX60DD055296 AMEX60DD000689 AMEX60DD000771 AMEX60DD000779 AMEX60DD001324 AMEX60DD001326 AMEX60DD001357 AMEX60DD001360 AMEX60DD002915 AMEX60DD003526 AMEX60DD003880 AMEX60DD004101 AMEX60DD004681 AMEX60DD005105 AMEX60DD006557 AMEX60DD006636 AMEX60DD007076 AMEX60DD007285 AMEX60DD007334 AMEX60DD007477 AMEX60DD008417 AMEX60DD009295 AMEX60DD009574 AMEX60DD009798 AMEX60DD009801 AMEX60DD009804 AMEX60DD009919 AMEX60DD011472 AMEX60DDU001000655 AMEX60DDU001003245 AMEX60DDU001003877 AMEX60DDU001005940 AMEX60DDU001006311 AMEX60DDU001008806 AMEX60DDU001009862 AMEX60DDU001010046 AMEX60DDU001010870 AMEX60DDU001012183 AMEX60DDU001014216 AMEX60DDU001014514 AMEX60DDU001017139 AMEX60DDU001018885 AMEX60DDU001019261 AMEX60DDU001019409 AMEX60DDU001020872 AMEX60DDU001021039 AMEX60DDU001021439 AMEX60DDU001024501 AMEX60DDU001024997 AMEX60DDU001025643 AMEX60DDU001026913 AMEX60DDU001027220 AMEX60DDU001028117 AMEX60DDU001031417 AMEX60DDU001032002 AMEX60DDU001034358 AMEX60DDU001035591 AMEX60DDU001036459 AMEX60DDU001037000 AMEX60DDU001039985 AMEX60DDU001040851 AMEX60DDU001041028
The description file is identical to clusterProfiler
TERM2NAME . Again, we don’t want to use any code from the
creation of clusterProfiler objects to ensure this part can
be used alone.
# Step 1: Get terms from ontology object created from go.obo with ontologyIndex function:
ontology_term_names <- ontology$name
# Step 2: Filter and separate GO terms from the annotation file
# We filter out rows where no GO terms are assigned and separate comma-delimited GO terms
go_terms <- eggnog_anno %>%
dplyr::select(GOs, `#query`) %>%
dplyr::filter(GOs != "-") %>% # Keep only rows with GO terms
separate_rows(GOs, sep = ",") %>%
dplyr::mutate(term = GOs) %>%
dplyr::select(term, `#query`) %>%
distinct() %>%
drop_na() # Drop rows with missing values
# Step 3: Create the description by matching GO terms to their names in the ontology
go_des <- go_terms %>%
dplyr::mutate(name = ontology_term_names[term]) %>% # Map term to its name from the ontology
dplyr::select(term, name) %>% # Keep only the term and name
distinct() %>% # Remove duplicates
drop_na() %>% # Remove rows with missing values
filter(!grepl("obsolete", name)) # Remove obsolete terms if present
# Save to file
write.table(go_des, file = "Axolotl_GO.des", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Check
head(go_des)
More extra work for the sake of portability :-)
# Get columns from kegg gmt object
kegg_gmt_columns <- kegg_gmt %>%
separate(gmt_entry, into = c("term", "external_link", "genes"), sep = "\t")
## Warning: Expected 3 pieces. Additional pieces discarded in 381 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Create the kegg_des table by joining the pathways file with the species-specific terms from kegg_gmt
kegg_des <- kegg_pathways %>%
dplyr::filter(term %in% kegg_gmt_columns$term) %>%
dplyr::select(term, name) %>%
distinct() %>%
drop_na()
# Save to file
write.table(kegg_des, file = "Axolotl_KEGG-pathways.des", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Check the first few rows
head(kegg_des)
WebGestaltR ORA and GSEA with custom database
filesWe will use the same approach as with clusterProfiler
and run ORA with GO and GSEA with KEGG for both tools.
Parameters to note for novel species:
organism = othersenrichDatabaseFile = "Axolotl_GO.gmt"enrichDatabaseDescriptionFile = "Axolotl_GO.des"outputDirectory <- "WebGestaltR_results"
project <- "Axolotl_ORA_GO"
WebGestaltR(
organism = "others", # must specify 'others' when using custom db files
enrichMethod = "ORA", # Perform ORA, GSEA or NTA
interestGene = degs, # gene list of interest
referenceGene = background, # background genes
enrichDatabaseFile = "Axolotl_GO.gmt", # the custom gmt file
enrichDatabaseDescriptionFile = "Axolotl_GO.des", # the custom description file
isOutput = TRUE, # Set to FALSE if you don't want files saved to disk
fdrMethod = "BH", # Benjamini-Hochberg multiple testing correction
sigMethod = "fdr", # Significance method ('fdr' or 'top')
fdrThr = 0.05, # FDR significance threshold
minNum = 10, # Minimum number of genes per category
maxNum = 500, # Maximum number of genes per category
outputDirectory = outputDirectory,
projectName = project
)
## ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory.
## [1] "ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory."
Open the HTML report
WebGestaltR_results/Project_Axolotl_ORA_GO/Report_Axolotl_ORA_GO.html
from the files pane in a browser.
Note some term similarity to what we have seen with the past 2 analyses (that’s reassuring!)
We no longer have GO Slim, as this needs to call the actual GO database, which we haven’t used.
Change the ‘Enrichment Results’ view from table to ‘Bar chart’, then try the ‘Affinity propagation’ and ‘Weighted set cover’ term clustering algorithms. ‘All’ has more terms with higher specificty, and the term redundancy has performed clustering to give fewwer terms and provide a more concise overview. It’s up to you as the researcher to decided which approach is best suited to your dataset!
Confirm that our GMT file correctly included the term link by
selecting a term and clicking the hyperlink at Analyte set.
Pretty neat huh :-)
This will take slightly longer than ORA. We will set threads to 7 to speed it up as much as we can.
There is no seed parameter for WebGestaltR
GSEA as there is for clusterProfiler. We can set it in R
instead with set.seed().
set.seed(123)
Again we are specifying organism = "others" and
providing our GMT and description file:
outputDirectory <- "WebGestaltR_results"
project <- "Axolotl_GSEA_KEGG"
suppressWarnings({ WebGestaltR(
organism = "others", # must specify 'others' when using custom db files
enrichMethod = "GSEA", # Perform ORA, GSEA or NTA
interestGene = ranked_df, # ranked dataframe
enrichDatabaseFile = "Axolotl_KEGG-pathways.gmt", # the custom gmt file
enrichDatabaseDescriptionFile = "Axolotl_KEGG-pathways.des", # the custom description file
isOutput = TRUE, # Set to FALSE if you don't want files saved to disk
fdrMethod = "BH", # Benjamini-Hochberg multiple testing correction
sigMethod = "fdr", # Significance method ('fdr' or 'top')
fdrThr = 0.05, # FDR significance threshold
minNum = 10, # Minimum number of genes per category
maxNum = 500, # Maximum number of genes per category
outputDirectory = outputDirectory,
projectName = project,
nThreads = 7
) })
## ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory.
## [1] "ERROR: The output directory WebGestaltR_results does not exist. please change another directory or create the directory."
Open the HTML report
WebGestaltR_results/Project_Axolotl_GSEA_KEGG/Report_Axolotl_GSEA_KEGG.html
from the files pane in a browser.
Expand ‘Job summary’ to read that “22 positive related categories and
no negative related categories” are significant in this analysis. This
is in contrast to the one negative category we observed when running
KEGG GSEA with clusterProfiler. We expect some differences
between these tools.
Compare the tabular results in this report to the treeplot we
produced under code chunk treeplot CP KEGG GSEA. There are
a lot of shared terms, and this is reassuring.
Print the database version of GO Core Ontology used:
# Read go.obo lines
lines <- readLines("go.obo")
# Use grep to pull "data-version"
version <- grep("data-version", lines, value = TRUE)
# Print version
cat("GO version from go.obo file:", version, "\n")
## GO version from go.obo file: data-version: releases/2024-06-17
The KEGG pathways file does not contain any version details within the file contents, but does have the date saved in the name of the file that was imported into this workbook. Adding the date of download was done manually, and is always recommended practice for files and databases that do not contain any date or version details.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] enrichplot_1.26.2 WebGestaltR_0.4.6 clusterProfiler_4.14.1
## [4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [7] purrr_1.0.2 tidyr_1.3.1 tibble_3.2.1
## [10] ggplot2_3.5.1 tidyverse_2.0.0 ontologyIndex_2.12
## [13] dplyr_1.1.4 readr_2.1.5
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.2 gson_0.1.0 rlang_1.1.3
## [4] magrittr_2.0.3 DOSE_4.0.0 compiler_4.4.2
## [7] RSQLite_2.3.7 systemfonts_1.0.6 png_0.1-8
## [10] vctrs_0.6.5 reshape2_1.4.4 pkgconfig_2.0.3
## [13] crayon_1.5.2 fastmap_1.1.1 XVector_0.46.0
## [16] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.26
## [19] tzdb_0.4.0 UCSC.utils_1.2.0 bit_4.0.5
## [22] xfun_0.43 zlibbioc_1.52.0 cachem_1.0.8
## [25] aplot_0.2.3 GenomeInfoDb_1.42.0 jsonlite_1.8.8
## [28] blob_1.2.4 highr_0.10 BiocParallel_1.40.0
## [31] parallel_4.4.2 R6_2.5.1 bslib_0.7.0
## [34] stringi_1.8.4 RColorBrewer_1.1-3 jquerylib_0.1.4
## [37] GOSemSim_2.32.0 iterators_1.0.14 Rcpp_1.0.12
## [40] knitr_1.46 ggtangle_0.0.4 R.utils_2.12.3
## [43] IRanges_2.40.0 Matrix_1.7-1 splines_4.4.2
## [46] igraph_2.0.3 timechange_0.3.0 tidyselect_1.2.1
## [49] qvalue_2.38.0 rstudioapi_0.16.0 yaml_2.3.8
## [52] doParallel_1.0.17 codetools_0.2-19 curl_5.2.1
## [55] doRNG_1.8.6 lattice_0.22-5 plyr_1.8.9
## [58] treeio_1.30.0 Biobase_2.66.0 withr_3.0.0
## [61] KEGGREST_1.46.0 evaluate_0.23 gridGraphics_0.5-1
## [64] Biostrings_2.74.0 ggtree_3.14.0 pillar_1.9.0
## [67] rngtools_1.5.2 whisker_0.4.1 foreach_1.5.2
## [70] stats4_4.4.2 ggfun_0.1.7 generics_0.1.3
## [73] vroom_1.6.5 S4Vectors_0.44.0 hms_1.1.3
## [76] tidytree_0.4.6 munsell_0.5.1 scales_1.3.0
## [79] apcluster_1.4.13 glue_1.7.0 lazyeval_0.2.2
## [82] tools_4.4.2 ggnewscale_0.5.0 data.table_1.15.4
## [85] fgsea_1.32.0 fs_1.6.4 fastmatch_1.1-4
## [88] cowplot_1.1.3 grid_4.4.2 ape_5.8
## [91] AnnotationDbi_1.68.0 colorspace_2.1-0 nlme_3.1-165
## [94] GenomeInfoDbData_1.2.13 patchwork_1.3.0 cli_3.6.2
## [97] fansi_1.0.6 svglite_2.1.3 gtable_0.3.5
## [100] R.methodsS3_1.8.2 yulab.utils_0.1.8 sass_0.4.9
## [103] digest_0.6.35 BiocGenerics_0.52.0 ggrepel_0.9.6
## [106] ggplotify_0.1.2 farver_2.1.2 memoise_2.0.1
## [109] htmltools_0.5.8.1 R.oo_1.26.0 lifecycle_1.0.4
## [112] httr_1.4.7 GO.db_3.20.0 bit64_4.0.5
And RStudio version. Typically, we would simply run
RStudio.Version() to print the version details. However,
when we knit this document to HTML, the RStudio.Version()
function is not available and will cause an error. So to make sure our
version details are saved to our static record of the work, we will save
to a file, then print the file contents back into the notebook.
# Get RStudio version information
rstudio_info <- RStudio.Version()
# Convert the version information to a string
rstudio_version_str <- paste(
"RStudio Version Information:\n",
"Version: ", rstudio_info$version, "\n",
"Release Name: ", rstudio_info$release_name, "\n",
"Long Version: ", rstudio_info$long_version, "\n",
"Mode: ", rstudio_info$mode, "\n",
"Citation: ", rstudio_info$citation,
sep = ""
)
# Write the output to a text file
writeLines(rstudio_version_str, "rstudio_version.txt")
# Read the saved version information from the file
rstudio_version_text <- readLines("rstudio_version.txt")
# Print the version information to the document
rstudio_version_text
## [1] "RStudio Version Information:"
## [2] "Version: 2023.6.1.524"
## [3] "Release Name: Mountain Hydrangea"
## [4] "Long Version: 2023.06.1+524"
## [5] "Mode: server"
## [6] "Citation: list(title = \"RStudio: Integrated Development Environment for R\", author = list(list(given = \"Posit team\", family = NULL, role = NULL, email = NULL, comment = NULL)), organization = \"Posit Software, PBC\", address = \"Boston, MA\", year = \"2023\", url = \"http://www.posit.co/\")"
The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.
On the editor pane toolbar, under Preview, select Knit to HTML.
If you have already run Preview, you will see Knit instead of Preview.
The HTML file will be saved in the same directory as the notebook, and with the same filename, but the .Rmd prefix will be replaced by .html. The knit HTML will typically open automatically once complete. If you receive a popup blocker error, click cancel, and in the Files pane of RStudio, single click the gprofiler.html file and select View in Web Browser.
Note that the notebook will only successfully knit if there are no errors in the code. You can ‘preview’ HTML with code errors.