TMBur: a distributable tumor mutation burden approach for whole genome sequencing


	Methods



	Sequencing

Tumor specimens were collected from biopsies (needle core, endobronchial ultrasound) or tissue resections (Additional file2: Table S2). Solid tumor specimens were snap frozen and liquid biopsies were spun down. Pathology was reviewed and nucleic acids were extracted as described in Pleasance et al. [1]. Sequencing was performed on either HiSeq 2500 or HiSeq X instruments to target 80X coverage for the tumor samples and 40X coverage for the matched normal samples. The reference cell lines COLO829 and COLO829BL were obtained from American Type Culture Collection (ATCC), Manassas, VA, and sequenced to 80X and 40X coverage, respectively.


	TMBur pipeline

The TMBur pipeline, implemented in Nextflow [14], performs all analysis steps to generate TMB estimates from fastq files, including adapter trimming with fastp [15], alignment with BWA mem 0.7.17 [16], and alignment sorting and duplicate marking with Samtools 1.9 [17]. Somatic variants are identified using Manta 1.6.0 [18], Strelka 2.6.2 [19], and Mutect2 from GATK 4.0.10.0 [20]. Variants from these tools are intersected using RTGTools [21] to generate the calls used for further analysis. Microsatellite instability (MSI) is estimated using MSIsensor2 0.1 [14], while all annotation of variants is done against Ens75 [22] using SNPEff 4.3t [23]. Intersections of coordinate ranges are performed using bedtools 2.29.2 [24]. These tools are provided in a singularity 3.5.2–1.1.el7 [25] container downloaded by the workflow at runtime.
Calculation of TMB for somatic variant genomic subsets is as follows: ** Genome counts of variants that overlap with chromosomes 1-22XY are divided by the alignable space (non-N bases) in 1-22XY (n = 2667837836 bases).
** Coding counts of variants that intersect with CDS bases from Ens75 are divided by the total CDS bases in 1-22XY (n = 31990128 bases).
** Protein counts of variants with SnpEff impact rating of “HIGH” or “MODERATE” using Ens75 are divided by the total CDS bases in 1-22XY.
** Panel using methods outlined in [26], the count of variants is limited to the CDS of a set of cancer genes (Additional file3: Table S3) minus: (a) any variants intersecting with COSMIC [27] (b) nonsense SNVs and SNPEff impact ratings of “HIGH”, “MODERATE”, or “LOW” in tumor suppressor genes (Additional file4: Table S4).

Variant calling using TMBur is stable down to a normal depth of 30X, and a tumor depth of 50X with bioinformatics estimated tumor content values above 30%, based on tests performed using Mutect2 and Strelka2, components of TMBur (Additional file8: Fig. S1).


	Reproducibility at multiple sites

Analysis infrastructure at Canada’s Michael Smith Genome Sciences Centre at BC Cancer in Vancouver, Canada and The Hospital for Sick Children in Toronto, Canada, used CentOS 7 on Intel-based CPUs. Vancouver’s cluster scheduling was done with slurm, and Toronto’s was performed with Moab/Torque (Additional file5: Table S5).


	Clinical data collection and survival outcomes

The ICI-treated patient cohort’s complete treatment histories, response, and survival data were collected retrospectively using the BC Cancer Pharmacy database and chart review, and as described in Pender et al. [2]. Patients received either a single-agent ICI, combination ICIs, or a combination of ICI and chemotherapy. Follow-up was censored on March 1, 2019. Time to progression (TTP) was defined as the time from ICI initiation to the date of discontinuation due to progression.
Kaplan–Meier survival analysis was performed for TTP using the R packages survival (v3.1-8) and survminer (v0.4.7). For all survival analyses, patients were split into high and low groups based on a threshold of 10/Mb. Log-rank tests were used to calculate p values. Cox proportional hazards models were performed on 78 samples using the R packages survival (v2.42.3) and forest model (v0.5.0) individually for each TMB estimate combined with tumor type. Tumor types were only included if at least three patients with that tumor type were available. Counts from indel-only groups and MSI scores were excluded from these analyses, as the 10/Mb threshold was not appropriate.
