Introduction

The main purpose of gnomeR is to streamline the processing of genomic files provided by CbioPortal. If you wish to learn how to use the integrated API please read the API-tutorial article first. The core function of the processing of these files is performed by the binmat() function. It takes the following arguments:

  • patients: A character vector that let’s the user specify the patients to be used to create the matrix. Default is NULL, in which case all samples found in the provided genetic files will be used.
  • maf: A MAF file.
  • mut.type: The mutation type to be used. Options are “SOMATIC”, “GERMLINE” or “ALL”. Note “ALL” will keep all mutations regardless of status (not recommended). Default is SOMATIC.
  • SNP.only Boolean to rather the genetics events to be kept only to be SNPs (insertions and deletions will be removed). Default is FALSE.
  • include.silent: Boolean to keep or remove all silent mutations. TRUE keeps, FALSE removes. Default is FALSE.
  • fusion: An optional MAF file for fusions. If inputed the outcome will be added to the matrix with columns ending in “.fus”. Default is NULL. Note if fusions are found in the MAF file in the previous input then these will be added automatically.
  • cna: An optional CNA files. If inputed the outcome will be added to the matrix with columns ending in “.cna”, .del" and “.amp” (depending on subsequent arguments).
  • cna.binary: A boolean argument specifying if the cna events should be enforced as binary. In which case separate columns for amplifications and deletions will be created. Default is FALSE in which case columns ending in “.cna” will be added to the output.
  • cna.relax: If cna.binary is TRUE, cna data only enables to count both gains and shallow deletions as amplifications and deletions respectively.
  • specify.plat: boolean specifying if specific IMPACT platforms should be considered. When TRUE NAs will fill the cells for genes of patients that were not sequenced on that plaform. Default is TRUE.
  • set.plat: character argument specifying which IMPACT platform the data should be reduced to if specify.plat is set to TRUE. Options are “341” and “410”. Default is NULL.
  • rm.empty: boolean specifying if columns with no events founds should be removed. Default is TRUE.
  • pathway: boolean specifying if pathway annotation should be applied. If TRUE, the function will return a supplementary binary dataframe with columns being each pathway and each row being a sample. Default is FALSE.
  • col.names: character vector of the necessary columns to be used. By default: col.names = c(Tumor_Sample_Barcode = NULL, Hugo_Symbol = NULL, Variant_Classification = NULL, Mutation_Status = NULL, Variant_Type = NULL)
  • oncokb boolean specfiying if maf file should be oncokb annotated. Variants found to be ‘Oncogenic’ and ‘Likely Oncogenic’ will be kept. Default is FALSE.
  • oncokb: boolean specfiying if maf file should be oncokb annotated. Default is FALSE.
  • keep_onco A character vector specifying which oncoKB annotated variants to keep. Options are ‘Oncogenic’, ‘Likely Oncogenic’, ‘Predicted Oncogenic’, ‘Likely Neutral’ and ‘Inconclusive’. By default ‘Oncogenic’, ‘Likely Oncogenic’ and ‘Predicted Oncogenic’ variants will be kept (recommended).
  • token: the token affiliated to your oncoKB account.

This function returns a matrix containing all the genetic information with rows as samples and columns as features. A warning will be thrown if some samples were found to have no mutations in the MAF file.

In the follwing sections we will present examples to process each of the datatypes in cbioportal.

Processing genetic data

Mutations

The most commmon type of genetic features used in genomic studies at MSKCC. The IMPACT sequencing panel consist of a curated list of genes that are known to have cancer related properties when altered. You can find a complete list of these genes and which platform they were added on in the impact_genes datafile.

We included in gnomeR an example of raw downloaded MAF file directly from the website in the mut dataset. We show here an example selecting a random subset of 100 samples in the mut dataset:

as_tibble(mut)
#> # A tibble: 3,179 x 45
#>    Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#>    <fct>       <lgl>          <lgl>  <fct>      <fct>               <int>
#>  1 TP53        NA             NA     GRCh37     17                7577539
#>  2 EZH2        NA             NA     GRCh37     7               148523605
#>  3 MDM2        NA             NA     GRCh37     12               69222656
#>  4 IGF1R       NA             NA     GRCh37     15               99486153
#>  5 KEAP1       NA             NA     GRCh37     19               10602303
#>  6 KDM5C       NA             NA     GRCh37     X                53223386
#>  7 KRAS        NA             NA     GRCh37     12               25398284
#>  8 TERT        NA             NA     GRCh37     5                 1295228
#>  9 MAP2K1      NA             NA     GRCh37     15               66729153
#> 10 NCOR1       NA             NA     GRCh37     17               16046949
#> # … with 3,169 more rows, and 39 more variables: End_Position <int>,
#> #   Strand <fct>, Consequence <fct>, Variant_Classification <fct>,
#> #   Variant_Type <fct>, Reference_Allele <fct>, Tumor_Seq_Allele1 <fct>,
#> #   Tumor_Seq_Allele2 <fct>, dbSNP_RS <lgl>, dbSNP_Val_Status <lgl>,
#> #   Tumor_Sample_Barcode <fct>, Matched_Norm_Sample_Barcode <lgl>,
#> #   Match_Norm_Seq_Allele1 <lgl>, Match_Norm_Seq_Allele2 <lgl>,
#> #   Tumor_Validation_Allele1 <lgl>, Tumor_Validation_Allele2 <lgl>,
#> #   Match_Norm_Validation_Allele1 <lgl>, Match_Norm_Validation_Allele2 <lgl>,
#> #   Verification_Status <lgl>, Validation_Status <lgl>, Mutation_Status <chr>,
#> #   Sequencing_Phase <lgl>, Sequence_Source <lgl>, Validation_Method <lgl>,
#> #   Score <lgl>, BAM_File <lgl>, Sequencer <lgl>, t_ref_count <int>,
#> #   t_alt_count <int>, n_ref_count <lgl>, n_alt_count <lgl>, HGVSc <fct>,
#> #   HGVSp <fct>, HGVSp_Short <fct>, Transcript_ID <fct>, RefSeq <fct>,
#> #   Protein_position <int>, Codons <fct>, Hotspot <int>
samples <- as.character(unique(mut$Tumor_Sample_Barcode))[sample(1:length(unique(mut$Tumor_Sample_Barcode)), 100, replace=FALSE)]
df <- binmat(patients = samples ,maf = mut)
kable(df[1:10, 1:10])
TP53 IGF1R KEAP1 KDM5C KRAS TERT MAP2K1 NCOR1 DDR2 FIP1L1
P-0010604-T01-IM5 1 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 1 0 0 0 0 0 0 0 0 0
P-0000270-T01-IM3 1 0 0 0 0 0 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 1 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 1 0 1 0 1 0 0 0 0 0
P-0003964-T01-IM3 0 0 1 0 1 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Note that by default in the situation above the outputted dataframe is a binary matrix made from all types of mutations and adjusting the features for the platform they were added on. Thus all samples that were sequenced on the original platform have NA’s in the cells of for features that were added on subsequent platforms. In the case where the user plans on using methods that do not accept missing values, the specify.plat argument can be changed to FALSE to replace all the NA’s mentioned above to 0. We show below such an example, we moreover make this example including only SNPs (including silent mutations):

df <- binmat(patients = samples ,maf = mut, SNP.only = TRUE, include.silent = TRUE, specify.plat = FALSE)
kable(df[1:10, 1:10])
TP53 IGF1R KEAP1 KDM5C KRAS TERT MAP2K1 NCOR1 DDR2 FIP1L1
P-0010604-T01-IM5 1 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 1 0 0 0 0 0 0 0 0 0
P-0000270-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 1 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 0 0 1 0 1 0 0 0 0 0
P-0003964-T01-IM3 0 0 1 0 1 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Fusions

Fusions are a particular genetic event where two genes merge to create a fusion gene which is a hybrid gene formed from the two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. In IMPACT datasets these can be found either in their own file or aggregated in the MAF file for mutations. In general the file containing the fusions will be very similar to a MAF file, explaining why they may be found together. We show here how to process these alterations in both cases listed above. Note that fusions are particularly rare events and thus the resulting data is very sparse.

We included in gnomeR an example of raw downloaded MAF file directly from the website in the fusion dataset. We show here an example selecting the same random subset of 100 samples as in the previous section:

as_tibble(fusion)
#> # A tibble: 127 x 10
#>    Hugo_Symbol Entrez_Gene_Id Center Tumor_Sample_Ba… Fusion DNA_support
#>    <fct>                <int> <fct>  <fct>            <fct>  <fct>      
#>  1 PAX8                    NA MSKCC… P-0010011-T01-I… PAX8-… yes        
#>  2 TFE3                    NA MSKCC… P-0010977-T01-I… ASPSC… yes        
#>  3 ASPSCR1                 NA MSKCC… P-0010977-T01-I… ASPSC… yes        
#>  4 BRAF                    NA MSKCC… P-0010398-T01-I… OSBPL… yes        
#>  5 OSBPL9                  NA MSKCC… P-0010398-T01-I… OSBPL… yes        
#>  6 ALK                     NA MSKCC… P-0010177-T01-I… EML4-… yes        
#>  7 EML4                    NA MSKCC… P-0010177-T01-I… EML4-… yes        
#>  8 MLL3                    NA MSKCC… P-0010604-T01-I… MLL3-… yes        
#>  9 ERG                     NA MSKCC… P-0010794-T01-I… TMPRS… yes        
#> 10 TMPRSS2                 NA MSKCC… P-0010794-T01-I… TMPRS… yes        
#> # … with 117 more rows, and 4 more variables: RNA_support <fct>, Method <lgl>,
#> #   Frame <fct>, Comments <fct>
df <- binmat(patients = samples ,fusion = fusion)
kable(df[1:10, 1:10])
BRAF.fus OSBPL9.fus ALK.fus EML4.fus MLL3.fus BRCA2.fus ERG.fus TMPRSS2.fus ATM.fus ELOVL4.fus
P-0010604-T01-IM5 0 0 0 0 1 0 0 0 0 0
P-0002651-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0000270-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003964-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Similarly to the mutation data the fusions are affected by the specify.plat and set.plat arguments as well.

Copy-number alterations (CNA)

The final type of data we have left to cover are CNAs. This a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. We will show in this section how to process CNA from IMPACT data. Once again we include an example dataset, cna in gnomeR.

The processing function for CNA is affected by two additional arguments:

  • cna.binary: boolean declaring if the CNA data should be segregated between amplification and deletions or kept as factor variable with its original levels
  • cna.relax: a boolean declaring if only deep deletions and full amplifications should be annotated in the case where cna.binary is set to FALSE.

Note that the specify.plat and set.plat also affect CNA.

Processing raw CNA data

By default amplifications and deletions will be separated and only deep deletions/full amplifications will accounted as shown below.

df <- binmat(patients = samples, cna = cna)
kable(df[1:10, 1:10])
AKT2.Amp AR.Amp ARID5B.Del AURKA.Amp AXIN2.Amp BRIP1.Amp CCND1.Amp CCNE1.Amp CD79B.Amp CDK12.Amp
P-0010604-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 0 0 0 0 0 0 1 0 0 0
P-0000270-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003964-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Setting cna.binary argument to FALSE yields the following events coded in a single column with their original levels:

df <- binmat(patients = samples, cna = cna, cna.binary = FALSE)
kable(df[1:10, 1:10])
AKT2.cna AKT3.cna AR.cna ARID5B.cna AURKA.cna AXIN1.cna AXIN2.cna BRCA2.cna BRIP1.cna CCND1.cna
P-0010604-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 0 0 0 0 0 0 0 0 0 2
P-0000270-T01-IM3 0 0 0 0 0 0 0 0 2 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003964-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Setting cna.binary argument to FALSE yields the following events coded in a single column with their original levels:

df <- binmat(patients = samples, cna = cna,cna.binary = FALSE)
kable(df[1:10, 1:10])
AKT2.cna AKT3.cna AR.cna ARID5B.cna AURKA.cna AXIN1.cna AXIN2.cna BRCA2.cna BRIP1.cna CCND1.cna
P-0010604-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 0 0 0 0 0 0 0 0 0 2
P-0000270-T01-IM3 0 0 0 0 0 0 0 0 2 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003964-T01-IM3 0 0 0 0 0 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0

Processing full genetic profiles for samples

By combining all the types of data presented above, binmat() will provide a complete genomic profile for the specified samples. This can be done with any combination of the arguments presented above.

Using raw data files

Once again we show how to perform this using the files directly downloaded from cBioPortal, combining the example sets mut, fusion and cna:

df <- binmat(patients = samples,maf = mut, fusion = fusion, cna = cna, cna.binary = FALSE)
kable(df[1:10, c(1:3,243,244,300:305)])
TP53 IGF1R KEAP1 BRAF.fus OSBPL9.fus FANCA.cna FGF19.cna FGF3.cna FGF4.cna FGFR1.cna FGFR2.cna
P-0010604-T01-IM5 1 0 0 0 0 0 0 0 0 0 0
P-0002651-T01-IM3 1 0 0 0 0 0 2 2 2 0 0
P-0000270-T01-IM3 1 0 0 0 0 0 0 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 0 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0 0
P-0000080-T01-IM3 1 0 0 0 0 0 0 0 0 0 0
P-0001741-T01-IM3 1 0 1 0 0 0 0 0 0 0 0
P-0003964-T01-IM3 0 0 1 0 0 0 0 0 0 0 0
P-0003842-T01-IM5 0 0 0 0 0 0 0 0 0 0 0
P-0002597-T02-IM5 0 0 0 0 0 0 0 0 0 0 0

Pathway level alterations

Pathway level alteration analysis is commonly encountered in genomic studies. We therefore implemented in binmat() the pathway argument that enables the user to generate a second binary matrix with pathways as new features (columns) for each of the samples (rows). These pathways were created following Dr. Schultz’s group in their paper Oncogenic Signaling Pathways in The Cancer Genome Atlas. It consist of 10 well defined pathways that have biological impacts in cancer patients. We show below an example of this matrix.

df <- binmat(patients = samples,maf = mut, cna = cna,pathway = T)
kable(df$pathway_dat[1:5,])
RTK-RAS TGF-B c-MYC chromosomal_Instability p53 WNT Min1 PI3K cell_cycle hippo notch
P-0010604-T01-IM5 0 0 0 1 1 0 1 0 0 0 1
P-0002651-T01-IM3 0 0 0 1 1 0 1 0 1 1 1
P-0000270-T01-IM3 0 0 0 1 1 0 1 0 0 0 0
P-0002915-T01-IM3 0 0 0 0 0 0 0 1 0 0 0
P-0011099-T01-IM5 0 0 0 0 0 0 0 0 0 0 0

Note that this is returned as an additional element of binmat() and the original binary matrix for all genes is still returned.

Custom pathways

In particular problems it may be of interest of creating pathways customized to the study conducted. To enable the user to use these specfic pathways we created the custom_pathway() function, this function takes as argument a binary output from binmat() with a dataframe containing the name of the genes and their corresponding pathways:

df <- binmat(patients = samples,maf = mut, fusion = fusion, cna = cna)
pathway <- as.data.frame(cbind(c("path1","path1","path2","path3"),
                               c("PIK3CA","KRAS, NRAS","TERT","TP53")))
pathway_dat <- custom_pathway(mat = df, pathway = pathway)
kable(pathway_dat[1:10,],row.names = T)
path1 path2 path3
P-0010604-T01-IM5 0 0 1
P-0002651-T01-IM3 0 1 1
P-0000270-T01-IM3 0 0 1
P-0002915-T01-IM3 1 0 0
P-0011099-T01-IM5 0 0 0
P-0000080-T01-IM3 0 0 1
P-0001741-T01-IM3 0 0 1
P-0003964-T01-IM3 0 0 0
P-0003842-T01-IM5 0 0 0
P-0002597-T02-IM5 0 0 0

Note that the different genetic events are considered separetely here. As an example say one wants to include a fourth pathway for TP53 deletions and fusions, these must be specified as such, see example below:

pathway <- as.data.frame(cbind(c("path1","path1","path2","path3", "path4"),
                               c("PIK3CA","KRAS, NRAS","TERT","TP53","TP53.Del, TP53.fus")))
pathway_dat <- custom_pathway(mat = df, pathway = pathway)
kable(pathway_dat[1:10,],row.names = T)
path1 path2 path3 path4
P-0010604-T01-IM5 0 0 1 0
P-0002651-T01-IM3 0 1 1 0
P-0000270-T01-IM3 0 0 1 0
P-0002915-T01-IM3 1 0 0 0
P-0011099-T01-IM5 0 0 0 0
P-0000080-T01-IM3 0 0 1 0
P-0001741-T01-IM3 0 0 1 0
P-0003964-T01-IM3 0 0 0 0
P-0003842-T01-IM5 0 0 0 0
P-0002597-T02-IM5 0 0 0 0

OncoKB annotation

OncoKB annotates the biological and oncogenic effect and the prognostic and predictive significance of somatic molecular alterations. Potential treatment implications are stratified by the level of evidence that a specific molecular alteration is predictive of drug response based on US Food and Drug Administration (FDA) labeling, National Comprehensive Cancer Network (NCCN) guidelines, disease-focused expert group recommendations and the scientific literature. For more information see the manuscript OncoKB: A Precision Oncology Knowledge Base or the oncoKB website. In gnomeR we include a simple wrapper of the [oncokb-annotator](https://github.com/oncokb/oncokb-annotator) package’s functions to allow users to annotate mutations, fusions and copy number events through the oncokb() function. Note that one of the required inputs is a token which allows users access to the oncoKB annotator. Users are thus required to request a token on oncoKB. If you are an MSKCC employee simply request a token for academic use and one will be provided to you automatically.

Once the user has a token, the annotating function oncokb() can be used on standardized MAF, fusion and CNA files:

gen_oncokb <- oncokb(maf = mut, fusion = fusion, cna = cna, token = "your_token")

We moreover include an argument in the binmat() function to oncoKB annotate files while creating a binary matrix by setting oncokb to TRUE. Note that only ‘Oncogenic’ and ‘Likely Oncogenic’ variants will be kept.

df <- binmat(maf = mut, fusion = fusion, cna = cna, oncokb = TRUE, token = "your_token")

FACETs

The copy-number alterations data we have covered up to now is a discrete estimation of the alterations that occured. There however exist more nuanced and accurate data for copy-number alterations observed in a tumor. In gnomeR we include an example of segmentation file and relevant functions from the facets package that provides an allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. We show below an example of a segmentation file included in gnomeR (seg dataset) and how to process it:

kable(seg[1:10,])
ID chrom loc.start loc.end num.mark seg.mean
P-0001669-T01-IM3 1 2488138 16265857 110 -0.0100
P-0001669-T01-IM3 1 17345415 22587878 10 -0.4776
P-0001669-T01-IM3 1 27023463 120199189 190 0.0070
P-0001669-T01-IM3 1 120458623 120465330 8 2.3998
P-0001669-T01-IM3 1 120466434 120539788 23 1.2126
P-0001669-T01-IM3 1 120548082 150552561 9 0.4508
P-0001669-T01-IM3 1 152330945 190734051 85 -0.0338
P-0001669-T01-IM3 1 193091396 245977996 113 0.4273
P-0001669-T01-IM3 2 4717089 242800953 419 -0.0127
P-0001669-T01-IM3 3 1449872 194238390 458 0.0051

We see that this files include segments of all chromosome for each patient with the number of marks and mean intensity in that segments. We can process this data into a format that can be used for visualization and analysis using the facets.dat() function that takes the following arguments:

  • seg: a segmentation file
  • filenames: the names of the segment files to be loaded and processed (Note must end in “.Rdata”).
  • path: the relative path to the files folder from your current directory
  • patients: the names of the patients of the respective filenames. Default is using all samples available.
  • min.purity: the minimum purity of the sample required to be kept in the final dataset. Default is 0.3.
  • epsilon: level of unions when aggregating segments between. Default is 0.005.
  • adaptive: CNregions option to create adaptive segments. Default is FALSE.
facet <- facets.dat(seg = seg, patients = samples, epsilon = 0.005)

This function returns a dataframe that is ready for visualization and analysis with samples as rows and processed segments as columns:

kable(facet$out.cn[1:5,1:3])
chr1.2488138-11167550 chr1.11167550-15296019 chr1.15296019-17345415
P-0000080-T01-IM3 0.1680 0.1680 0.1680
P-0000140-T01-IM3 -0.1550 -0.1550 0.0092
P-0000185-T01-IM3 0.1126 0.1126 0.1126
P-0000244-T01-IM3 -0.3711 -0.0041 -0.0041
P-0000270-T01-IM3 -0.0857 -0.0857 -0.0857