PICTographPlus

1. Introduction

PICTographPlus is a computational tool that integrates bulk DNA and RNA sequencing data to:

  1. Reconstruct Clone-Specific Transcriptomic Profiles
  2. Infer Tumor Evolution
  3. Identify Transcriptional Transitions Between Clones

The tool infers tumor clonal evolution from single or multi-region sequencing data by modeling the uncertainty of mutation cellular fraction (MCF) in small somatic mutations (SSMs) and copy number alterations (CNAs). Using a Bayesian hierarchical model, it assigns SSMs and CNAs to subclones, reconstructing tumor evolutionary trees that adhere to principles of lineage precedence, sum condition, and optional constraints based on sample presence. For deconvolution, PICTographPlus integrates tumor clonal tree structures with clone proportions across samples to resolve bulk gene expression data. It optimizes an objective function that minimizes discrepancies between observed and predicted sample-level gene expression while imposing a smoothness penalty, ensuring that closely related clones display greater gene expression similarity. Lastly, the tool conducts pathway enrichment analysis to identify statistically significant alterations in pathways connecting tumor clones.


2. Input data file format

2.1 Input data (DNA) for tumor evolution reconstruction

PICTographPlus takes input data in multiple formats for flexible user inputs:

  1. Three CSV files, one each for SSMs, CNAs, and germline heterozygous SNVs (section 2.1.1).
  2. Two csv files, one for SSM and one for CNA (section 2.1.2).
  3. A single csv file that contains SSM and CNA information (section 2.1.3).

2.1.1 Three CSV files, one each for SSMs, CNAs, and germline heterozygous SNVs.

When to use: if you are using a copy number caller that only outputs log2 ratio or total copy number without BAF information. e.g. cnvkit.

The SSM file should contain at least sample, mutation, total_reads, alt_reads, chrom, start, and end columns, with the purity column being optional.

head(read.csv(system.file('extdata/examples/example4_snv_with_purity.csv', 
                          package = 'pictographPlus')))
#>    sample mutation total_reads alt_reads chrom start  end purity
#> 1 sample1     mut1         100        67  chr1    57   57    0.8
#> 2 sample1     mut2         100        67  chr1   110  110    0.8
#> 3 sample1     mut3         100        67  chr1   167  167    0.8
#> 4 sample1     mut4         100        40  chr1   386  386    0.8
#> 5 sample1     mut5         100        40  chr1   441  441    0.8
#> 6 sample1     mut6         100        40  chr1  1276 1276    0.8

The CNA file should contain sample, chrom, start, end, and tcn columns. The total copy number (tcn) can be inferred from many copy number callers. In case a copy number caller outputs the \(log2\) ratio of a segment, the tcn can be calculate using \(tcn = 2 * 2^{\log_2 R}\).

head(read.csv(system.file('extdata/examples/example4_cna.csv', 
                          package = 'pictographPlus')))
#>    sample chrom start  end tcn
#> 1 sample1  chr1     1  200 3.6
#> 2 sample2  chr1     1  200 3.4
#> 3 sample3  chr1     1  200 3.4
#> 4 sample3  chr1   401 1500 1.4
#> 5 sample3  chr1  7001 7200 2.6

The SNV file contains the count of the heterozygous germline SNVs that has the information about the “chrom”, the “position” of the germline heterozygous SNV, “ref” and “alt” allele, and the reference and altenative reads counts in germline (normal) sample as well as all other samples. Note: the sample name (sample1, sample2, … etc.) should matched the sample name used in the SSM and CNA file.

head(read.csv(system.file('extdata/examples/example4_SNP.csv', 
                          package = 'pictographPlus')))
#>   chroms position ref alt germline_ref germline_alt sample1_ref sample1_alt
#> 1   chr1      178   A   C           50           50          72          28
#> 2   chr1      107   A   C           50           50          72          28
#> 3   chr1       84   A   C           50           50          28          72
#> 4   chr1       31   A   C           50           50          28          72
#> 5   chr1       18   A   C           50           50          72          28
#> 6   chr1      165   A   C           50           50          28          72
#>   sample2_ref sample2_alt sample3_ref sample3_alt
#> 1          71          29          71          29
#> 2          71          29          71          29
#> 3          29          71          29          71
#> 4          29          71          29          71
#> 5          71          29          71          29
#> 6          29          71          29          71

Obtaining Germline Heterozygous Positions

The heterozygous germline positions can be obtained using tools such as GATK HaplotypeCaller. The reference and alternative read counts for each tumor sample can be retrieved using samtools mpileup.

We provide a Python script, getPileUp.py, bundled with the package, to help generate the desired SNV file. Locate the installed copy with:

system.file("scripts", "getPileUp.py", package = "pictographPlus")

and invoke it from the shell (substitute the path printed above for <path>):

python <path>/getPileUp.py -v haplotype.vcf -b sample1.bam sample2.bam ... \
                           -o outputDir -f hg38.fa [--minreads] [--vaf]
Parameter Description Option
-v VCF output from HaplotypeCaller Required
-b Tumor BAM files for tumor samples Required
-o Output directory Required
-f Human reference genome Required
–minreads Minimum read count for both ref and alt to keep a site Default: 3
–vaf Minimum and maximum VAF for normal heterozygous sites Default: 0.1 0.9

This script creates a folder outputDir/pileup, which contains a germline_het.txt file and all pileup_summary.txt files. These files are used to extract the germline heterozygous positions and convert them to the SNV file format described above. The resulting germline_SNV.csv can be used as input for PICTographPlus.

Note: Ensure that the BAM file names (e.g., sample1.bam) match the sample names (e.g., sample1) used in the SSM and CNA files. If needed, rename the columns in germline_SNV.csv so they match the sample names in your SSM and CNA files.

2.1.2 Two csv files, one for SSM and one for CNA

When to use: if you are using a copy number caller that also includes BAF information. e.g. facets.

The SSM file should contain columns “sample”, “mutation”, “total_reads”, “alt_reads”, “chrom”, “start”, and “end”. The “purity” column with be optional.

head(read.csv(system.file('extdata/examples/example3_snv_with_purity.csv', 
                          package = 'pictographPlus')))
#>    sample mutation total_reads alt_reads chrom start  end purity
#> 1 sample1     mut1         100        67  chr1    57   57    0.8
#> 2 sample1     mut2         100        67  chr1   110  110    0.8
#> 3 sample1     mut3         100        67  chr1   167  167    0.8
#> 4 sample1     mut4         100        40  chr1   386  386    0.8
#> 5 sample1     mut5         100        40  chr1   441  441    0.8
#> 6 sample1     mut6         100        40  chr1  1276 1276    0.8

The CNA file should contain columns “sample”, “chrom”, “start”, “end”, “tcn” and “baf”. Depending on the copy number caller used, the users may need to convert the numbers to the desired format.

For example, if using facets for copy number calls, users can get tcn by using the formula \(tcn = ploidy * 2^{cnlr.median.clust}\), and baf using the formula \(baf = 1 / (1 + 1 / 2 ^ {mafR.clust})\)

head(read.csv(system.file('extdata/examples/example3_cna.csv', 
                          package = 'pictographPlus')))
#>    sample chrom start  end tcn       baf
#> 1 sample1  chr1     1  200 3.6 0.2777778
#> 2 sample2  chr1     1  200 3.4 0.2941176
#> 3 sample3  chr1     1  200 3.4 0.2941176
#> 4 sample3  chr1   401 1500 1.4 0.2857143
#> 5 sample3  chr1  7001 7200 2.6 0.3846154

2.1.3 A single csv file that contains SSM and CNA information

The last option is to provide a single csv file that contains at least columns named “sample”, “mutation”, “total_reads”, “alt_reads”, “tumor_integer_copy_number”, and “cncf”. Set cncf to 0 if a mutation has no copy number alteration. Users can also provide an optional column “major_integer_copy_number” that provides the information of the integer copy number of the major allele. If “major_integer_copy_number” is not provided, it will be estimated using an internal function built in the package. Another optional column is “purity” column that provides the information of normal contamination of a sample.

NOTE: using this option will generate trees with SSMs only, CNA will not be assigned to clusters but only used for VAF correction.

head(read.csv(system.file('extdata/examples/example2_snv_with_purity.csv', 
                          package = 'pictographPlus')))
#>    sample mutation total_reads alt_reads tumor_integer_copy_number
#> 1 sample1     mut1         100        67                         4
#> 2 sample1     mut2         100        67                         4
#> 3 sample1     mut3         100        67                         4
#> 4 sample1     mut4         100        40                         2
#> 5 sample1     mut5         100        40                         2
#> 6 sample1     mut6         100        40                         2
#>   major_integer_copy_number cncf purity
#> 1                         3  0.8    0.8
#> 2                         3  0.8    0.8
#> 3                         3  0.8    0.8
#> 4                         1  0.0    0.8
#> 5                         1  0.0    0.8
#> 6                         1  0.0    0.8

2.2 Input data for bulk RNA expression

The RNA file should be a csv file of columns Gene, followed by the tumor samples (tumor sample name should match that of the genomic input), and lastly the read counts of a matched normal sample (optional). The read counts should be raw counts.

head(read.csv(system.file('extdata/examples/rna_example.csv', 
                          package = 'pictographPlus')))
#>      Gene sample1 sample2 sampleN
#> 1    A1BG       0       5       0
#> 2    A1CF       0       0       0
#> 3     A2M      21      50       9
#> 4   A2ML1       0       0       0
#> 5 A3GALT2       0       0       0
#> 6  A4GALT       1       1       0

3. Run PICTographPlus

3.1 Running PICTographPlus in one step

PICTographPlus can be run using the function runPICTographPlus, which runs both tumor evolution reconstruction and clone-specific transcriptomic profile deconvolution. The required files include files for genomic data and RNA expression data.

runPICTographPlus(mutation_file, copy_number_file, SNV_file, rna_file, outputDir) # if using input format 2.1.1
runPICTographPlus(mutation_file, copy_number_file, rna_file, outputDir) # if using input format 2.1.2
runPICTographPlus(mutation_file, rna_file, outputDir) # if using input format 2.1.3

The default deconvolution model is tree_delta (λ = 0.05), recommended for most use cases when a matched normal sample is available. To use a different model:

runPICTographPlus(mutation_file, rna_file, outputDir, model = "plain", lambda = 0.10)

3.2 Running PICTographPlus in multiple steps

Alternatively, the user can run tumor evolution reconstruction and bulk RNA deconvolution in separate steps.

Tumor evolution reconstruction can be run using:

runPictograph(mutation_file, copy_number_file, SNV_file, outputDir) # if using input format 2.1.1
runPictograph(mutation_file, copy_number_file, outputDir) # if using input format 2.1.2
runPictograph(mutation_file, outputDir) # if using input format 2.1.3

Bulk RNA expression deconvolution can be run using:

runDeconvolution(rna_file, treeFile, proportionFile, outputDir)

where the treeFile, and proportionFile are outputs of runPictograph. Users may also choose other tumor evolution reconstruction tools. File formats can be found in section 6.

The deconvolution model and regularisation strength λ can be specified:

# Default: tree_delta, lambda = 0.05
runDeconvolution(rna_file, treeFile, proportionFile, outputDir,
                 model = "tree_delta", lambda = 0.05)

# Plain Laplacian, optimised for Pearson recovery
runDeconvolution(rna_file, treeFile, proportionFile, outputDir,
                 model = "plain", lambda = 0.10)

GSEA analysis using fgsea can be run using:

X_optimal <- read.csv(paste0(outputDir, "/clonal_expression.csv"),
                      row.names=1, check.names=FALSE)

runGSEA(X_optimal, outputDir, treeFile, GSEA_file)

where treeFile is the DNA-derived clonal tree from runPictograph (tree.csv) and GSEA_file is a text file of pathways of interest from MSigDB.

Note: GSEA always uses the DNA-derived clonal tree (treeFile / tree.csv) to define edges for pathway enrichment, even when deconvolution was run with use_star_tree = TRUE. The star topology is used only during deconvolution to avoid over-fitting to a potentially mis-inferred tree; once clone expressions are estimated, the true phylogeny should guide the edge-wise differential expression.


4. Parameters

The parameters are for function bash runPICTographPlus and the indicated module.

Parameter Description default module
mutation_file a csv file for SSMs see section 2.1 runPictograph
copy_number_file a csv file for CNAs see section 2.1 runPictograph
SNV_file a csv file for germline heterozygous SNVs see section 2.1 runPictograph
outputDir output directory to save all outputs NULL runPictograph runDeconvolution runGSEA
sample_presence separate mutations based on pattern of presence in all samples TRUE runPictograph
score scoring function to estimate number of clusters. silhoutte or BIC silhuette runPictograph
max_K maximum number of clusters 10 runPictograph
min_mutation_per_cluster minimum number of mutations in each cluster 5 runPictograph
min_cluster_thresh minimum MCF for each cluster 0.05 runPictograph
cluster_diff_thresh threshold for MCF difference to merge two clusters 0.05 runPictograph
n.iter number of iterations by JAGS 5000 runPictograph
n.burn number of burns by JAGS 1000 runPictograph
thin number of thin by JAGS 10 runPictograph
mc.cores number of cores used for parallel computing; not applicable to Windows 8 runPictograph
inits additional parameters for JAGS list(“.RNG.name” = “base::Wichmann-Hill”,“.RNG.seed” = 123) runPictograph
LOH whether to include copy-neutral LOH event in CNA FALSE runPictograph
driverFile list of driver genes used for visualization NULL runPictograph
cnv_min_length minimum length of CNA for it to be included in the analysis 1000000 runPictograph
tcn_normal_range range of total copy number considered as copy-neutral c(1.75,2.3) runPictograph
filter_cnv whether or not to filter CNAs TRUE runPictograph
smooth_cnv whether or not to process copy number alterations across samples to unify the segment start and end postions TRUE runPictograph
autosome to only include autosome TRUE runPictograph
rna_file bulk RNA file in integer read counts; rows are samples and columns are genes Required runDeconvolution
treeFile tumor clone tree file NULL runDeconvolution runGSEA
proportionFile subclone proportion file NULL runDeconvolution
normalize normalize the raw count using DESeq2 TRUE runDeconvolution
purityFile tumor purity file NULL runDeconvolution
lambda regularisation strength λ for deconvolution 0.05 runDeconvolution
model deconvolution model (see section 3.3) “tree_delta” runDeconvolution
lambda_l2 fixed L2 Laplacian weight for elastic_net model 0.01 runDeconvolution
n_iter IRLS iterations for adaptive/adaptive_v2 models 5 runDeconvolution
verbose print solver convergence progress FALSE runDeconvolution
GSEA_file geneset file in MSigDB .gmt format NULL runGSEA
top_K top_K significant pathways to be plotted as GSEA results 5 runGSEA
n_permutations number of permutations in fgsea 10000 runGSEA

3.3 Choosing a deconvolution model

Seven deconvolution models are available. All models minimise the reconstruction loss ‖Y − ΠX‖² with different regularisation strategies. Benchmarking was performed on 320 synthetic pseudo-bulk replicates (P7 tumour, 4 purity levels × 4 sample counts × 20 seeds, max tree depth = 5, K = 8 clones).

Model selection quick-reference:

Scenario Recommended model Why
Default — matched normal sample available elastic_net (λ=0.01) Best synthetic F1 (0.347) and Sensitivity (0.368).
With-normal, if interpretability favoured tree_delta (λ=0.05) Nearly-tied F1 (0.339) with an explicit tree-structured prior; also strongest on with_extnorm.
With-normal, prioritise precision / low-FDR adaptive (λ=0.50) Highest MCC in with_normal (0.248).
Tumor-only (no normal reference) adaptive_v2 (λ=0.50) Best F1 (0.348), Sensitivity (0.360), MCC (0.256).
External (population-average) normal only tree_delta (λ=0.05) Best F1 (0.293) and Sensitivity (0.276).
Highest absolute expression recovery (Pearson r) plain (λ=0.10) Top star-topology Pearson r (0.942) among 7 models.

λ selection guidance: - The star-best λ values above are derived from the benchmarking study and are topology-agnostic (selected on a star topology to avoid over-fitting to a known tree). - If you have a matched normal sample, use model = "elastic_net", lambda = 0.01; for interpretability prefer model = "tree_delta", lambda = 0.05. - If no normal is available, use model = "adaptive_v2", lambda = 0.50. - fused_ew and elastic_net require lambda > 0; they will error at lambda = 0. - The cost of using star-best λ versus true-tree-best λ is at most ≈0.008 in Pearson r across all models.

5. Output files

File name Description
mcf.csv estimated MCF for each cluster in each sample
clusterAssign.csv cluster assignment of each SSM/CNA to each cluster
CN_results.csv estimation of the integer and major copy number of CNAs
tree.csv the tree with the highest score; all trees with tied highest score is available under all_trees directory.
subclone_proportion.csv estimated proportion of each cluster in each sample
purity.csv estimated purity for each sample, based on the tree structure
tree.png the image of a tree with the best score
upsetR.png the mutation profiles between samples; only available if number of samples is bigger than 1.
mcf.png the MCF chain trace from JAGS
mutationClusterAssign.csv table of all mutations information in all samples
clone_expression.csv clone level gene expression for each clone
GSEA directory contains all files from GSEA analysis

6. Additional file format

6.1 treeFile

A CSV file describing the tumor evolution tree used by the functions runDeconvolution() and runGSEA().

edge parent child
root->1 root 1
1->2 1 2

6.2 proportionFile

A CSV file of subclone proportions used by runDeconvolution(). Each row corresponds to a clone, and each column to a sample.

sample1 sample2
1 0.12 0.30
2 0.82 0.70

6.3 purityFile

A CSV file of tumor purity used by runDeconvolution().

sample1 sample2
0.70 0.50

6.4 driverFile

A CSV file that contains driver mutation information used for plotting by runPictograph().
If you use this file, ensure that the corresponding mutation file follows the format gene_extraInformation, where gene is the actual gene name. The gene_type column can be oncogene or tumor_suppressor (leave blank if unknown).

gene chrom start end gene_type
KRAS chr12 25205246 25250936 oncogene
SMAD4 chr18 51028528 51085045 tumor_suppressor

6.5 cytobandFile

A TSV file of cytoband information used for plotting by runPictograph().

chr start end band stain
chr1 0 2300000 p36.33 gneg
chr1 2300000 5300000 p36.32 gpos25