PICTographPlus is a computational tool that integrates bulk DNA and RNA sequencing data to:
The tool infers tumor clonal evolution from single or multi-region sequencing data by modeling the uncertainty of mutation cellular fraction (MCF) in small somatic mutations (SSMs) and copy number alterations (CNAs). Using a Bayesian hierarchical model, it assigns SSMs and CNAs to subclones, reconstructing tumor evolutionary trees that adhere to principles of lineage precedence, sum condition, and optional constraints based on sample presence. For deconvolution, PICTographPlus integrates tumor clonal tree structures with clone proportions across samples to resolve bulk gene expression data. It optimizes an objective function that minimizes discrepancies between observed and predicted sample-level gene expression while imposing a smoothness penalty, ensuring that closely related clones display greater gene expression similarity. Lastly, the tool conducts pathway enrichment analysis to identify statistically significant alterations in pathways connecting tumor clones.
PICTographPlus takes input data in multiple formats for flexible user inputs:
When to use: if you are using a copy number caller that only outputs log2 ratio or total copy number without BAF information. e.g. cnvkit.
The SSM file should contain at least sample, mutation, total_reads, alt_reads, chrom, start, and end columns, with the purity column being optional.
head(read.csv(system.file('extdata/examples/example4_snv_with_purity.csv',
package = 'pictographPlus')))
#> sample mutation total_reads alt_reads chrom start end purity
#> 1 sample1 mut1 100 67 chr1 57 57 0.8
#> 2 sample1 mut2 100 67 chr1 110 110 0.8
#> 3 sample1 mut3 100 67 chr1 167 167 0.8
#> 4 sample1 mut4 100 40 chr1 386 386 0.8
#> 5 sample1 mut5 100 40 chr1 441 441 0.8
#> 6 sample1 mut6 100 40 chr1 1276 1276 0.8The CNA file should contain sample, chrom, start, end, and tcn columns. The total copy number (tcn) can be inferred from many copy number callers. In case a copy number caller outputs the \(log2\) ratio of a segment, the tcn can be calculate using \(tcn = 2 * 2^{\log_2 R}\).
head(read.csv(system.file('extdata/examples/example4_cna.csv',
package = 'pictographPlus')))
#> sample chrom start end tcn
#> 1 sample1 chr1 1 200 3.6
#> 2 sample2 chr1 1 200 3.4
#> 3 sample3 chr1 1 200 3.4
#> 4 sample3 chr1 401 1500 1.4
#> 5 sample3 chr1 7001 7200 2.6The SNV file contains the count of the heterozygous germline SNVs that has the information about the “chrom”, the “position” of the germline heterozygous SNV, “ref” and “alt” allele, and the reference and altenative reads counts in germline (normal) sample as well as all other samples. Note: the sample name (sample1, sample2, … etc.) should matched the sample name used in the SSM and CNA file.
head(read.csv(system.file('extdata/examples/example4_SNP.csv',
package = 'pictographPlus')))
#> chroms position ref alt germline_ref germline_alt sample1_ref sample1_alt
#> 1 chr1 178 A C 50 50 72 28
#> 2 chr1 107 A C 50 50 72 28
#> 3 chr1 84 A C 50 50 28 72
#> 4 chr1 31 A C 50 50 28 72
#> 5 chr1 18 A C 50 50 72 28
#> 6 chr1 165 A C 50 50 28 72
#> sample2_ref sample2_alt sample3_ref sample3_alt
#> 1 71 29 71 29
#> 2 71 29 71 29
#> 3 29 71 29 71
#> 4 29 71 29 71
#> 5 71 29 71 29
#> 6 29 71 29 71The heterozygous germline positions can be obtained using tools such as GATK HaplotypeCaller. The reference and alternative read counts for each tumor sample can be retrieved using samtools mpileup.
We provide a Python script, getPileUp.py, bundled with
the package, to help generate the desired SNV file. Locate the installed
copy with:
and invoke it from the shell (substitute the path printed above for
<path>):
python <path>/getPileUp.py -v haplotype.vcf -b sample1.bam sample2.bam ... \
-o outputDir -f hg38.fa [--minreads] [--vaf]| Parameter | Description | Option |
|---|---|---|
| -v | VCF output from HaplotypeCaller | Required |
| -b | Tumor BAM files for tumor samples | Required |
| -o | Output directory | Required |
| -f | Human reference genome | Required |
| –minreads | Minimum read count for both ref and alt to keep a site | Default: 3 |
| –vaf | Minimum and maximum VAF for normal heterozygous sites | Default: 0.1 0.9 |
This script creates a folder outputDir/pileup, which
contains a germline_het.txt file and all
pileup_summary.txt files. These files are used to extract
the germline heterozygous positions and convert them to the SNV file
format described above. The resulting germline_SNV.csv can
be used as input for PICTographPlus.
Note: Ensure that the BAM file names (e.g.,
sample1.bam) match the sample names (e.g.,sample1) used in the SSM and CNA files. If needed, rename the columns ingermline_SNV.csvso they match the sample names in your SSM and CNA files.
When to use: if you are using a copy number caller that also includes BAF information. e.g. facets.
The SSM file should contain columns “sample”, “mutation”, “total_reads”, “alt_reads”, “chrom”, “start”, and “end”. The “purity” column with be optional.
head(read.csv(system.file('extdata/examples/example3_snv_with_purity.csv',
package = 'pictographPlus')))
#> sample mutation total_reads alt_reads chrom start end purity
#> 1 sample1 mut1 100 67 chr1 57 57 0.8
#> 2 sample1 mut2 100 67 chr1 110 110 0.8
#> 3 sample1 mut3 100 67 chr1 167 167 0.8
#> 4 sample1 mut4 100 40 chr1 386 386 0.8
#> 5 sample1 mut5 100 40 chr1 441 441 0.8
#> 6 sample1 mut6 100 40 chr1 1276 1276 0.8The CNA file should contain columns “sample”, “chrom”, “start”, “end”, “tcn” and “baf”. Depending on the copy number caller used, the users may need to convert the numbers to the desired format.
For example, if using facets for copy number calls, users can get tcn by using the formula \(tcn = ploidy * 2^{cnlr.median.clust}\), and baf using the formula \(baf = 1 / (1 + 1 / 2 ^ {mafR.clust})\)
head(read.csv(system.file('extdata/examples/example3_cna.csv',
package = 'pictographPlus')))
#> sample chrom start end tcn baf
#> 1 sample1 chr1 1 200 3.6 0.2777778
#> 2 sample2 chr1 1 200 3.4 0.2941176
#> 3 sample3 chr1 1 200 3.4 0.2941176
#> 4 sample3 chr1 401 1500 1.4 0.2857143
#> 5 sample3 chr1 7001 7200 2.6 0.3846154The last option is to provide a single csv file that contains at least columns named “sample”, “mutation”, “total_reads”, “alt_reads”, “tumor_integer_copy_number”, and “cncf”. Set cncf to 0 if a mutation has no copy number alteration. Users can also provide an optional column “major_integer_copy_number” that provides the information of the integer copy number of the major allele. If “major_integer_copy_number” is not provided, it will be estimated using an internal function built in the package. Another optional column is “purity” column that provides the information of normal contamination of a sample.
NOTE: using this option will generate trees with SSMs only, CNA will not be assigned to clusters but only used for VAF correction.
head(read.csv(system.file('extdata/examples/example2_snv_with_purity.csv',
package = 'pictographPlus')))
#> sample mutation total_reads alt_reads tumor_integer_copy_number
#> 1 sample1 mut1 100 67 4
#> 2 sample1 mut2 100 67 4
#> 3 sample1 mut3 100 67 4
#> 4 sample1 mut4 100 40 2
#> 5 sample1 mut5 100 40 2
#> 6 sample1 mut6 100 40 2
#> major_integer_copy_number cncf purity
#> 1 3 0.8 0.8
#> 2 3 0.8 0.8
#> 3 3 0.8 0.8
#> 4 1 0.0 0.8
#> 5 1 0.0 0.8
#> 6 1 0.0 0.8The RNA file should be a csv file of columns Gene, followed by the tumor samples (tumor sample name should match that of the genomic input), and lastly the read counts of a matched normal sample (optional). The read counts should be raw counts.
head(read.csv(system.file('extdata/examples/rna_example.csv',
package = 'pictographPlus')))
#> Gene sample1 sample2 sampleN
#> 1 A1BG 0 5 0
#> 2 A1CF 0 0 0
#> 3 A2M 21 50 9
#> 4 A2ML1 0 0 0
#> 5 A3GALT2 0 0 0
#> 6 A4GALT 1 1 0PICTographPlus can be run using the function
runPICTographPlus, which runs both tumor evolution
reconstruction and clone-specific transcriptomic profile deconvolution.
The required files include files for genomic data and RNA expression
data.
runPICTographPlus(mutation_file, copy_number_file, SNV_file, rna_file, outputDir) # if using input format 2.1.1
runPICTographPlus(mutation_file, copy_number_file, rna_file, outputDir) # if using input format 2.1.2
runPICTographPlus(mutation_file, rna_file, outputDir) # if using input format 2.1.3The default deconvolution model is tree_delta (λ =
0.05), recommended for most use cases when a matched normal sample is
available. To use a different model:
Alternatively, the user can run tumor evolution reconstruction and bulk RNA deconvolution in separate steps.
Tumor evolution reconstruction can be run using:
runPictograph(mutation_file, copy_number_file, SNV_file, outputDir) # if using input format 2.1.1
runPictograph(mutation_file, copy_number_file, outputDir) # if using input format 2.1.2
runPictograph(mutation_file, outputDir) # if using input format 2.1.3Bulk RNA expression deconvolution can be run using:
where the treeFile, and proportionFile are outputs of
runPictograph. Users may also choose other tumor evolution
reconstruction tools. File formats can be found in section 6.
The deconvolution model and regularisation strength λ can be specified:
# Default: tree_delta, lambda = 0.05
runDeconvolution(rna_file, treeFile, proportionFile, outputDir,
model = "tree_delta", lambda = 0.05)
# Plain Laplacian, optimised for Pearson recovery
runDeconvolution(rna_file, treeFile, proportionFile, outputDir,
model = "plain", lambda = 0.10)GSEA analysis using fgsea can be run using:
X_optimal <- read.csv(paste0(outputDir, "/clonal_expression.csv"),
row.names=1, check.names=FALSE)
runGSEA(X_optimal, outputDir, treeFile, GSEA_file)where treeFile is the DNA-derived clonal tree from
runPictograph (tree.csv) and
GSEA_file is a text file of pathways of interest from MSigDB.
Note: GSEA always uses the DNA-derived clonal tree (
treeFile / tree.csv) to define edges for pathway enrichment, even when deconvolution was run withuse_star_tree = TRUE. The star topology is used only during deconvolution to avoid over-fitting to a potentially mis-inferred tree; once clone expressions are estimated, the true phylogeny should guide the edge-wise differential expression.
The parameters are for function bash runPICTographPlus
and the indicated module.
| Parameter | Description | default | module |
|---|---|---|---|
| mutation_file | a csv file for SSMs | see section 2.1 | runPictograph |
| copy_number_file | a csv file for CNAs | see section 2.1 | runPictograph |
| SNV_file | a csv file for germline heterozygous SNVs | see section 2.1 | runPictograph |
| outputDir | output directory to save all outputs | NULL | runPictograph runDeconvolution runGSEA |
| sample_presence | separate mutations based on pattern of presence in all samples | TRUE | runPictograph |
| score | scoring function to estimate number of clusters. silhoutte or BIC | silhuette | runPictograph |
| max_K | maximum number of clusters | 10 | runPictograph |
| min_mutation_per_cluster | minimum number of mutations in each cluster | 5 | runPictograph |
| min_cluster_thresh | minimum MCF for each cluster | 0.05 | runPictograph |
| cluster_diff_thresh | threshold for MCF difference to merge two clusters | 0.05 | runPictograph |
| n.iter | number of iterations by JAGS | 5000 | runPictograph |
| n.burn | number of burns by JAGS | 1000 | runPictograph |
| thin | number of thin by JAGS | 10 | runPictograph |
| mc.cores | number of cores used for parallel computing; not applicable to Windows | 8 | runPictograph |
| inits | additional parameters for JAGS | list(“.RNG.name” = “base::Wichmann-Hill”,“.RNG.seed” = 123) | runPictograph |
| LOH | whether to include copy-neutral LOH event in CNA | FALSE | runPictograph |
| driverFile | list of driver genes used for visualization | NULL | runPictograph |
| cnv_min_length | minimum length of CNA for it to be included in the analysis | 1000000 | runPictograph |
| tcn_normal_range | range of total copy number considered as copy-neutral | c(1.75,2.3) | runPictograph |
| filter_cnv | whether or not to filter CNAs | TRUE | runPictograph |
| smooth_cnv | whether or not to process copy number alterations across samples to unify the segment start and end postions | TRUE | runPictograph |
| autosome | to only include autosome | TRUE | runPictograph |
| rna_file | bulk RNA file in integer read counts; rows are samples and columns are genes | Required | runDeconvolution |
| treeFile | tumor clone tree file | NULL | runDeconvolution runGSEA |
| proportionFile | subclone proportion file | NULL | runDeconvolution |
| normalize | normalize the raw count using DESeq2 | TRUE | runDeconvolution |
| purityFile | tumor purity file | NULL | runDeconvolution |
| lambda | regularisation strength λ for deconvolution | 0.05 | runDeconvolution |
| model | deconvolution model (see section 3.3) | “tree_delta” | runDeconvolution |
| lambda_l2 | fixed L2 Laplacian weight for elastic_net model | 0.01 | runDeconvolution |
| n_iter | IRLS iterations for adaptive/adaptive_v2 models | 5 | runDeconvolution |
| verbose | print solver convergence progress | FALSE | runDeconvolution |
| GSEA_file | geneset file in MSigDB .gmt format | NULL | runGSEA |
| top_K | top_K significant pathways to be plotted as GSEA results | 5 | runGSEA |
| n_permutations | number of permutations in fgsea | 10000 | runGSEA |
Seven deconvolution models are available. All models minimise the reconstruction loss ‖Y − ΠX‖² with different regularisation strategies. Benchmarking was performed on 320 synthetic pseudo-bulk replicates (P7 tumour, 4 purity levels × 4 sample counts × 20 seeds, max tree depth = 5, K = 8 clones).
Model selection quick-reference:
| Scenario | Recommended model | Why |
|---|---|---|
| Default — matched normal sample available | elastic_net (λ=0.01) |
Best synthetic F1 (0.347) and Sensitivity (0.368). |
| With-normal, if interpretability favoured | tree_delta (λ=0.05) |
Nearly-tied F1 (0.339) with an explicit tree-structured prior; also strongest on with_extnorm. |
| With-normal, prioritise precision / low-FDR | adaptive (λ=0.50) |
Highest MCC in with_normal (0.248). |
| Tumor-only (no normal reference) | adaptive_v2 (λ=0.50) |
Best F1 (0.348), Sensitivity (0.360), MCC (0.256). |
| External (population-average) normal only | tree_delta (λ=0.05) |
Best F1 (0.293) and Sensitivity (0.276). |
| Highest absolute expression recovery (Pearson r) | plain (λ=0.10) |
Top star-topology Pearson r (0.942) among 7 models. |
λ selection guidance: - The star-best λ values above
are derived from the benchmarking study and are topology-agnostic
(selected on a star topology to avoid over-fitting to a known tree). -
If you have a matched normal sample, use
model = "elastic_net", lambda = 0.01; for
interpretability prefer model = "tree_delta",
lambda = 0.05. - If no normal is available, use
model = "adaptive_v2", lambda = 0.50. -
fused_ew and elastic_net require
lambda > 0; they will error at lambda = 0.
- The cost of using star-best λ versus true-tree-best λ is at most
≈0.008 in Pearson r across all models.
| File name | Description |
|---|---|
| mcf.csv | estimated MCF for each cluster in each sample |
| clusterAssign.csv | cluster assignment of each SSM/CNA to each cluster |
| CN_results.csv | estimation of the integer and major copy number of CNAs |
| tree.csv | the tree with the highest score; all trees with tied highest score is available under all_trees directory. |
| subclone_proportion.csv | estimated proportion of each cluster in each sample |
| purity.csv | estimated purity for each sample, based on the tree structure |
| tree.png | the image of a tree with the best score |
| upsetR.png | the mutation profiles between samples; only available if number of samples is bigger than 1. |
| mcf.png | the MCF chain trace from JAGS |
| mutationClusterAssign.csv | table of all mutations information in all samples |
| clone_expression.csv | clone level gene expression for each clone |
| GSEA | directory contains all files from GSEA analysis |
A CSV file describing the tumor evolution tree used by the functions
runDeconvolution() and runGSEA().
| edge | parent | child |
|---|---|---|
| root->1 | root | 1 |
| 1->2 | 1 | 2 |
A CSV file of subclone proportions used by
runDeconvolution(). Each row corresponds to a clone, and
each column to a sample.
| sample1 | sample2 | |
|---|---|---|
| 1 | 0.12 | 0.30 |
| 2 | 0.82 | 0.70 |
A CSV file of tumor purity used by
runDeconvolution().
| sample1 | sample2 |
|---|---|
| 0.70 | 0.50 |
A CSV file that contains driver mutation information used for
plotting by runPictograph().
If you use this file, ensure that the corresponding mutation file
follows the format gene_extraInformation, where
gene is the actual gene name. The
gene_type column can be oncogene or
tumor_suppressor (leave blank if unknown).
| gene | chrom | start | end | gene_type |
|---|---|---|---|---|
| KRAS | chr12 | 25205246 | 25250936 | oncogene |
| SMAD4 | chr18 | 51028528 | 51085045 | tumor_suppressor |
A TSV file of cytoband information used for plotting by
runPictograph().
| chr | start | end | band | stain |
|---|---|---|---|---|
| chr1 | 0 | 2300000 | p36.33 | gneg |
| chr1 | 2300000 | 5300000 | p36.32 | gpos25 |