The main purpose of gnomeR
is to streamline the processing of genomic files provided by CbioPortal. If you wish to learn how to use the integrated API please read the API-tutorial
article first. The core function of the processing of these files is performed by the binmat()
function. It takes the following arguments:
This function returns a matrix containing all the genetic information with rows as samples and columns as features. A warning will be thrown if some samples were found to have no mutations in the MAF file.
In the follwing sections we will present examples to process each of the datatypes in cbioportal.
The most commmon type of genetic features used in genomic studies at MSKCC. The IMPACT sequencing panel consist of a curated list of genes that are known to have cancer related properties when altered. You can find a complete list of these genes and which platform they were added on in the impact_genes
datafile.
We included in gnomeR
an example of raw downloaded MAF file directly from the website in the mut
dataset. We show here an example selecting a random subset of 100 samples in the mut
dataset:
as_tibble(mut)
#> # A tibble: 3,179 x 45
#> Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> <fct> <lgl> <lgl> <fct> <fct> <int>
#> 1 TP53 NA NA GRCh37 17 7577539
#> 2 EZH2 NA NA GRCh37 7 148523605
#> 3 MDM2 NA NA GRCh37 12 69222656
#> 4 IGF1R NA NA GRCh37 15 99486153
#> 5 KEAP1 NA NA GRCh37 19 10602303
#> 6 KDM5C NA NA GRCh37 X 53223386
#> 7 KRAS NA NA GRCh37 12 25398284
#> 8 TERT NA NA GRCh37 5 1295228
#> 9 MAP2K1 NA NA GRCh37 15 66729153
#> 10 NCOR1 NA NA GRCh37 17 16046949
#> # … with 3,169 more rows, and 39 more variables: End_Position <int>,
#> # Strand <fct>, Consequence <fct>, Variant_Classification <fct>,
#> # Variant_Type <fct>, Reference_Allele <fct>, Tumor_Seq_Allele1 <fct>,
#> # Tumor_Seq_Allele2 <fct>, dbSNP_RS <lgl>, dbSNP_Val_Status <lgl>,
#> # Tumor_Sample_Barcode <fct>, Matched_Norm_Sample_Barcode <lgl>,
#> # Match_Norm_Seq_Allele1 <lgl>, Match_Norm_Seq_Allele2 <lgl>,
#> # Tumor_Validation_Allele1 <lgl>, Tumor_Validation_Allele2 <lgl>,
#> # Match_Norm_Validation_Allele1 <lgl>, Match_Norm_Validation_Allele2 <lgl>,
#> # Verification_Status <lgl>, Validation_Status <lgl>, Mutation_Status <chr>,
#> # Sequencing_Phase <lgl>, Sequence_Source <lgl>, Validation_Method <lgl>,
#> # Score <lgl>, BAM_File <lgl>, Sequencer <lgl>, t_ref_count <int>,
#> # t_alt_count <int>, n_ref_count <lgl>, n_alt_count <lgl>, HGVSc <fct>,
#> # HGVSp <fct>, HGVSp_Short <fct>, Transcript_ID <fct>, RefSeq <fct>,
#> # Protein_position <int>, Codons <fct>, Hotspot <int>
samples <- as.character(unique(mut$Tumor_Sample_Barcode))[sample(1:length(unique(mut$Tumor_Sample_Barcode)), 100, replace=FALSE)]
df <- binmat(patients = samples ,maf = mut)
kable(df[1:10, 1:10])
TP53 | IGF1R | KEAP1 | KDM5C | KRAS | TERT | MAP2K1 | NCOR1 | DDR2 | FIP1L1 | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000270-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note that by default in the situation above the outputted dataframe is a binary matrix made from all types of mutations and adjusting the features for the platform they were added on. Thus all samples that were sequenced on the original platform have NA’s in the cells of for features that were added on subsequent platforms. In the case where the user plans on using methods that do not accept missing values, the specify.plat
argument can be changed to FALSE to replace all the NA’s mentioned above to 0. We show below such an example, we moreover make this example including only SNPs (including silent mutations):
df <- binmat(patients = samples ,maf = mut, SNP.only = TRUE, include.silent = TRUE, specify.plat = FALSE)
kable(df[1:10, 1:10])
TP53 | IGF1R | KEAP1 | KDM5C | KRAS | TERT | MAP2K1 | NCOR1 | DDR2 | FIP1L1 | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Fusions are a particular genetic event where two genes merge to create a fusion gene which is a hybrid gene formed from the two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. In IMPACT datasets these can be found either in their own file or aggregated in the MAF file for mutations. In general the file containing the fusions will be very similar to a MAF file, explaining why they may be found together. We show here how to process these alterations in both cases listed above. Note that fusions are particularly rare events and thus the resulting data is very sparse.
We included in gnomeR
an example of raw downloaded MAF file directly from the website in the fusion
dataset. We show here an example selecting the same random subset of 100 samples as in the previous section:
as_tibble(fusion)
#> # A tibble: 127 x 10
#> Hugo_Symbol Entrez_Gene_Id Center Tumor_Sample_Ba… Fusion DNA_support
#> <fct> <int> <fct> <fct> <fct> <fct>
#> 1 PAX8 NA MSKCC… P-0010011-T01-I… PAX8-… yes
#> 2 TFE3 NA MSKCC… P-0010977-T01-I… ASPSC… yes
#> 3 ASPSCR1 NA MSKCC… P-0010977-T01-I… ASPSC… yes
#> 4 BRAF NA MSKCC… P-0010398-T01-I… OSBPL… yes
#> 5 OSBPL9 NA MSKCC… P-0010398-T01-I… OSBPL… yes
#> 6 ALK NA MSKCC… P-0010177-T01-I… EML4-… yes
#> 7 EML4 NA MSKCC… P-0010177-T01-I… EML4-… yes
#> 8 MLL3 NA MSKCC… P-0010604-T01-I… MLL3-… yes
#> 9 ERG NA MSKCC… P-0010794-T01-I… TMPRS… yes
#> 10 TMPRSS2 NA MSKCC… P-0010794-T01-I… TMPRS… yes
#> # … with 117 more rows, and 4 more variables: RNA_support <fct>, Method <lgl>,
#> # Frame <fct>, Comments <fct>
df <- binmat(patients = samples ,fusion = fusion)
kable(df[1:10, 1:10])
BRAF.fus | OSBPL9.fus | ALK.fus | EML4.fus | MLL3.fus | BRCA2.fus | ERG.fus | TMPRSS2.fus | ATM.fus | ELOVL4.fus | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Similarly to the mutation data the fusions are affected by the specify.plat
and set.plat
arguments as well.
The final type of data we have left to cover are CNAs. This a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. We will show in this section how to process CNA from IMPACT data. Once again we include an example dataset, cna
in gnomeR
.
The processing function for CNA is affected by two additional arguments:
cna.binary
: boolean declaring if the CNA data should be segregated between amplification and deletions or kept as factor variable with its original levelscna.relax
: a boolean declaring if only deep deletions and full amplifications should be annotated in the case where cna.binary
is set to FALSE.Note that the specify.plat
and set.plat
also affect CNA.
By default amplifications and deletions will be separated and only deep deletions/full amplifications will accounted as shown below.
df <- binmat(patients = samples, cna = cna)
kable(df[1:10, 1:10])
AKT2.Amp | AR.Amp | ARID5B.Del | AURKA.Amp | AXIN2.Amp | BRIP1.Amp | CCND1.Amp | CCNE1.Amp | CD79B.Amp | CDK12.Amp | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Setting cna.binary
argument to FALSE yields the following events coded in a single column with their original levels:
df <- binmat(patients = samples, cna = cna, cna.binary = FALSE)
kable(df[1:10, 1:10])
AKT2.cna | AKT3.cna | AR.cna | ARID5B.cna | AURKA.cna | AXIN1.cna | AXIN2.cna | BRCA2.cna | BRIP1.cna | CCND1.cna | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Setting cna.binary
argument to FALSE yields the following events coded in a single column with their original levels:
df <- binmat(patients = samples, cna = cna,cna.binary = FALSE)
kable(df[1:10, 1:10])
AKT2.cna | AKT3.cna | AR.cna | ARID5B.cna | AURKA.cna | AXIN1.cna | AXIN2.cna | BRCA2.cna | BRIP1.cna | CCND1.cna | |
---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
By combining all the types of data presented above, binmat()
will provide a complete genomic profile for the specified samples. This can be done with any combination of the arguments presented above.
Once again we show how to perform this using the files directly downloaded from cBioPortal, combining the example sets mut
, fusion
and cna
:
df <- binmat(patients = samples,maf = mut, fusion = fusion, cna = cna, cna.binary = FALSE)
kable(df[1:10, c(1:3,243,244,300:305)])
TP53 | IGF1R | KEAP1 | BRAF.fus | OSBPL9.fus | FANCA.cna | FGF19.cna | FGF3.cna | FGF4.cna | FGFR1.cna | FGFR2.cna | |
---|---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002651-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 0 |
P-0000270-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0001741-T01-IM3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Pathway level alteration analysis is commonly encountered in genomic studies. We therefore implemented in binmat()
the pathway
argument that enables the user to generate a second binary matrix with pathways as new features (columns) for each of the samples (rows). These pathways were created following Dr. Schultz’s group in their paper Oncogenic Signaling Pathways in The Cancer Genome Atlas. It consist of 10 well defined pathways that have biological impacts in cancer patients. We show below an example of this matrix.
df <- binmat(patients = samples,maf = mut, cna = cna,pathway = T)
kable(df$pathway_dat[1:5,])
RTK-RAS | TGF-B | c-MYC | chromosomal_Instability | p53 | WNT | Min1 | PI3K | cell_cycle | hippo | notch | |
---|---|---|---|---|---|---|---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
P-0002651-T01-IM3 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
P-0000270-T01-IM3 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
P-0002915-T01-IM3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note that this is returned as an additional element of binmat()
and the original binary matrix for all genes is still returned.
In particular problems it may be of interest of creating pathways customized to the study conducted. To enable the user to use these specfic pathways we created the custom_pathway()
function, this function takes as argument a binary output from binmat()
with a dataframe containing the name of the genes and their corresponding pathways:
df <- binmat(patients = samples,maf = mut, fusion = fusion, cna = cna)
pathway <- as.data.frame(cbind(c("path1","path1","path2","path3"),
c("PIK3CA","KRAS, NRAS","TERT","TP53")))
pathway_dat <- custom_pathway(mat = df, pathway = pathway)
kable(pathway_dat[1:10,],row.names = T)
path1 | path2 | path3 | |
---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 1 |
P-0002651-T01-IM3 | 0 | 1 | 1 |
P-0000270-T01-IM3 | 0 | 0 | 1 |
P-0002915-T01-IM3 | 1 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 1 |
P-0001741-T01-IM3 | 0 | 0 | 1 |
P-0003964-T01-IM3 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 |
Note that the different genetic events are considered separetely here. As an example say one wants to include a fourth pathway for TP53 deletions and fusions, these must be specified as such, see example below:
pathway <- as.data.frame(cbind(c("path1","path1","path2","path3", "path4"),
c("PIK3CA","KRAS, NRAS","TERT","TP53","TP53.Del, TP53.fus")))
pathway_dat <- custom_pathway(mat = df, pathway = pathway)
kable(pathway_dat[1:10,],row.names = T)
path1 | path2 | path3 | path4 | |
---|---|---|---|---|
P-0010604-T01-IM5 | 0 | 0 | 1 | 0 |
P-0002651-T01-IM3 | 0 | 1 | 1 | 0 |
P-0000270-T01-IM3 | 0 | 0 | 1 | 0 |
P-0002915-T01-IM3 | 1 | 0 | 0 | 0 |
P-0011099-T01-IM5 | 0 | 0 | 0 | 0 |
P-0000080-T01-IM3 | 0 | 0 | 1 | 0 |
P-0001741-T01-IM3 | 0 | 0 | 1 | 0 |
P-0003964-T01-IM3 | 0 | 0 | 0 | 0 |
P-0003842-T01-IM5 | 0 | 0 | 0 | 0 |
P-0002597-T02-IM5 | 0 | 0 | 0 | 0 |
OncoKB annotates the biological and oncogenic effect and the prognostic and predictive significance of somatic molecular alterations. Potential treatment implications are stratified by the level of evidence that a specific molecular alteration is predictive of drug response based on US Food and Drug Administration (FDA) labeling, National Comprehensive Cancer Network (NCCN) guidelines, disease-focused expert group recommendations and the scientific literature. For more information see the manuscript OncoKB: A Precision Oncology Knowledge Base or the oncoKB website. In gnomeR
we include a simple wrapper of the [oncokb-annotator](https://github.com/oncokb/oncokb-annotator)
package’s functions to allow users to annotate mutations, fusions and copy number events through the oncokb()
function. Note that one of the required inputs is a token which allows users access to the oncoKB annotator. Users are thus required to request a token on oncoKB. If you are an MSKCC employee simply request a token for academic use and one will be provided to you automatically.
Once the user has a token, the annotating function oncokb()
can be used on standardized MAF, fusion and CNA files:
gen_oncokb <- oncokb(maf = mut, fusion = fusion, cna = cna, token = "your_token")
We moreover include an argument in the binmat()
function to oncoKB annotate files while creating a binary matrix by setting oncokb to TRUE. Note that only ‘Oncogenic’ and ‘Likely Oncogenic’ variants will be kept.
df <- binmat(maf = mut, fusion = fusion, cna = cna, oncokb = TRUE, token = "your_token")
The copy-number alterations data we have covered up to now is a discrete estimation of the alterations that occured. There however exist more nuanced and accurate data for copy-number alterations observed in a tumor. In gnomeR
we include an example of segmentation file and relevant functions from the facets
package that provides an allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. We show below an example of a segmentation file included in gnomeR
(seg
dataset) and how to process it:
kable(seg[1:10,])
ID | chrom | loc.start | loc.end | num.mark | seg.mean |
---|---|---|---|---|---|
P-0001669-T01-IM3 | 1 | 2488138 | 16265857 | 110 | -0.0100 |
P-0001669-T01-IM3 | 1 | 17345415 | 22587878 | 10 | -0.4776 |
P-0001669-T01-IM3 | 1 | 27023463 | 120199189 | 190 | 0.0070 |
P-0001669-T01-IM3 | 1 | 120458623 | 120465330 | 8 | 2.3998 |
P-0001669-T01-IM3 | 1 | 120466434 | 120539788 | 23 | 1.2126 |
P-0001669-T01-IM3 | 1 | 120548082 | 150552561 | 9 | 0.4508 |
P-0001669-T01-IM3 | 1 | 152330945 | 190734051 | 85 | -0.0338 |
P-0001669-T01-IM3 | 1 | 193091396 | 245977996 | 113 | 0.4273 |
P-0001669-T01-IM3 | 2 | 4717089 | 242800953 | 419 | -0.0127 |
P-0001669-T01-IM3 | 3 | 1449872 | 194238390 | 458 | 0.0051 |
We see that this files include segments of all chromosome for each patient with the number of marks and mean intensity in that segments. We can process this data into a format that can be used for visualization and analysis using the facets.dat()
function that takes the following arguments:
seg
: a segmentation filefilenames
: the names of the segment files to be loaded and processed (Note must end in “.Rdata”).path
: the relative path to the files folder from your current directorypatients
: the names of the patients of the respective filenames. Default is using all samples available.min.purity
: the minimum purity of the sample required to be kept in the final dataset. Default is 0.3.epsilon
: level of unions when aggregating segments between. Default is 0.005.adaptive
: CNregions option to create adaptive segments. Default is FALSE.
facet <- facets.dat(seg = seg, patients = samples, epsilon = 0.005)
This function returns a dataframe that is ready for visualization and analysis with samples as rows and processed segments as columns:
kable(facet$out.cn[1:5,1:3])
chr1.2488138-11167550 | chr1.11167550-15296019 | chr1.15296019-17345415 | |
---|---|---|---|
P-0000080-T01-IM3 | 0.1680 | 0.1680 | 0.1680 |
P-0000140-T01-IM3 | -0.1550 | -0.1550 | 0.0092 |
P-0000185-T01-IM3 | 0.1126 | 0.1126 | 0.1126 |
P-0000244-T01-IM3 | -0.3711 | -0.0041 | -0.0041 |
P-0000270-T01-IM3 | -0.0857 | -0.0857 | -0.0857 |