library(gnomeR)
library(dplyr)

Setting up the API

In order to download the data from CbioPortal, one must first require a token from the website CbioPortal wich will prompt a login page with your MSKCC credentials. Then navigate to “Web API” in the top bar menu, following this simply download a token and copy it after running the following command in R:

usethis::edit_r_environ()

And pasting the token you were given in the .Renviron file that was created and saving after pasting your token.

CBIOPORTAL_TOKEN = 'YOUR_TOKEN'

You can test your connection using:

get_cbioportal_token()

Datasets

There exist multiple datasets available in cbioPortal, and are available through our API:

  • IMPACT (integrated mutation profiling of actionable cancer targets) dataset, which consist of multiple targeted panels on which all new MSKCC patients are sequencing.
  • TCGA (the cancer genomic atlas) dataset, which consist of a completed pan-cancer study of full exon sequencing.

The get_genetics() function let’s the user download the mutation, fusion and copy-number alterations data in a Mutation Annotation Format (MAF) file for either the sample DMPID provided or a specific study. It takes the following arguments:

  • sample_ids: A character vector of sample ids from either IMPACT or TCGA
  • sample_list_id: A character vector naming a pre-specified list of samples (e.g. "mskimpact_Colorectal_Cancer")
  • genes: A list of genes to query. Default is all impact genes (recommended). A complete list of IDs can be found in the impact_gene_info dataset.
  • database: A character string of the database to be used. Options are “msk_impact” or “tcga”, default is “msk_impact”.
  • mutations: Boolean specifying if mutation data should be fected. Default is TRUE.
  • fusions: Boolean specifying if fusion data should be fected. Default is TRUE.
  • cna: Boolean specifying if cna data should be fected. Default is TRUE.
  • seg: Boolean specifying if the segmentatio data should be fetched. Default is FALSE.

This function returns either a MAF file or a copy-number alterations summary file, or both (mut and/or cna).

TCGA database

Even though cBioPortal mainly focuses on IMPACT data, other genomic studies are also available on it. Particurlarly the The Cancer Genome Atlas (TCGA) database, which whole-exome sequenced large cohorts of 33 different cancer sites. This dataset is a public ressource and does not require a token to be accessed through the gnomeR API, we will thus use it as an example for the functionalities of the API. Additionally to having access to mutational and copy number alteration data, cBioPortal also grants us access to other data types such as RNA-seq or RPPMs. In this section we will show how the user can use the API to retrieve this data. Note that the list of samples, cancer sites and genes available in tcga_samples and tcga_genes respectively.

Mutations

To retrieve the TCGA mutational data the user should set the arguments mutations to TRUE and database to “tcga”:

# df <- get_genetics(sample_ids =  c("TCGA-17-Z023-01","TCGA-02-0003-01","TCGA-02-0055-01"),
#              mutations = TRUE,fusions = FALSE, cna = FALSE,
#              database = "tcga")
# df$mut

Fusions

# df <- get_genetics(sample_ids = as.character(tcga_samples$patient_id[!is.na(tcga_samples$Cancer_Code)][1:100]),
#                    mutations = TRUE,fusions = TRUE, cna = FALSE,
#                    database = "tcga")
# df$mut %>%
#   filter(Variant_Classification == "Fusion")

CNA

# df <- get_genetics(sample_ids =  c("TCGA-17-Z023-01","TCGA-02-0003-01","TCGA-02-0055-01"),
#              mutations = FALSE,fusions = FALSE, cna = TRUE,
#              database = "tcga")
# df$cna

Segmentation file

The copy-number alterations data we have covered up to now is a discrete estimation of the alterations that occured. There however exist more nuanced and accurate data for copy-number alterations observed in a tumor. In gnomeR we include an example of segmentation file and relevant functions from the facets package that provides an allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. We show below how to download it from the API in gnomeR.

# df <- get_genetics(sample_ids =  c("TCGA-17-Z023-01","TCGA-02-0003-01","TCGA-02-0055-01"),
#              mutations = FALSE,fusions = FALSE, cna = FALSE, seg = TRUE,
#              database = "tcga")
# df$seg

IMPACT database

As mentioned previously IMPACT genomic data is protected and requires a token to be accessed. The ger_genetics() functions in the same way to the examples shown above for TCGA datasets.

Mutations

# df.mut <- get_genetics("P-0000062-T01-IM3",database = "msk_impact",
#                        mutations = TRUE, fusions = FALSE, cna = FALSE)
# df.mut

Adding Fusions

# df.fus <- get_genetics("P-0000062-T01-IM3",database = "msk_impact",
#                        mutations = TRUE, fusions = FALSE, cna = FALSE)
# df.fus

Copy-number alterations

# df.cna <- get_genetics("P-0000062-T01-IM3",database = "msk_impact",
#                        mutations = FALSE, fusions = FALSE, cna = TRUE)
# df.cna

All together

# df.gen <- get_genetics("P-0000062-T01-IM3",database = "msk_impact",
#                        mutations = TRUE, fusions = TRUE, cna = TRUE)
# df.gen

By study

We show here an example to retrieve all the samples in a study. A complete list of these studies can be found on the CbioPortal website. Not working yet.

# df.gen <- get_genetics(sample_list_id = "mskimpact_Colorectal_Cancer",database = "msk_impact",
#                        mutations = TRUE, fusions = TRUE, cna = TRUE)