TOSCA: an automated Tumor Only Somatic CAlling workflow for somatic mutation detection without matched normal samples


	2 Software description



	2.1 Implementation details

TOSCA is implemented using the Snakemake management system (Köster and Rahmann, 2012) and conda environments. Snakemake ensures a well-controlled and scalable execution of an end-to-end analysis of whole-exome sequencing (WES) and targeted panel sequencing (TS) data, starting from one or more fastq files with raw sequencing reads representing distinct samples, either single-end or paired-end. A full description of different software implemented in TOSCA is given inSupplementary File S1, while a schematic representation of the whole workflow is given inFigure 1.
In the core step of TOSCA variants are annotated and classified based on their status. The workflow can be executed in the absence of any kind of normal control (‘pure’ tumor-only mode) or with some normal specimens (‘hybrid’ mode). In the first case, TOSCA activates a custom tumor-only filtration strategy inspired by a decision tree filtration algorithm developed bySukhai et al. (2019). Our algorithm is designed in three phases and works in a similar way: first phase includes quality filtration based on quality pass or variant type (non-synonymous) criteria. Secondly, variants that exceeded the first step were marked as germline if they were present in any of the four germline variant databases at a minor allele frequency of 1% and labeled as somatic if they were found in the COSMIC database even if also found in a germline population database. Finally, in the third phase, variants were labeled germline if they were present as a benign or likely benign variants in ClinVar database. More details about filtration process can be found inSupplementary File S1. Our workflow can also be implemented in ‘hybrid’ mode by providing unmatched normal samples in the configuration file. TOSCA in this case re-runs all the steps including custom tumor-only filtration strategy and produces same outputs but in addiction, for each sample, activates a tumor purity and ploidy estimation via R package PureCN (Riester et al., 2016). This in turn yields a parallel assessment of somatic status that possibly outperform previous one in terms of accuracy. In comparison with previous approach, PureCN is also able to deal with tumor-only WES data, as shown byOh et al. (2020). The results obtained by these authors were highly concordant with matched normal WES data demonstrating that is possible to obtain reliable somatic mutational signatures also from whole-exome data in absence of normal tissue controls.
Other tools can be easily exchanged for those listed above by modifying the Snakefile and/or the template analysis code.


	2.2 Input and output file

In order to run TOSCA, only two files must be prepared: a metadata with paths of FASTQ raw data and a configuration file with all the parameters required for the computation. First, the user is prompted to choose which version of the human genome to use as reference between hg19 and hg38. It is possible to set additional thresholds and parameters, including variant allele fraction and sequencing depth. Default and suggested cutoffs for both WES and TS data are provided inSupplementary File S1. Tumor-only filtering also requires queries against multiple databases. TOSCA relies on the following four germline population databases: 1000 Genomes phase 3, Exome Sequencing Project (ESP; ESP6500SI-V2 dataset of Exome Variant Server, National Heart, Lung, and Blood Institute Grand Opportunity Exome Sequencing Project, Seattle, WA, USA), Exome Aggregation Consortium version r1 (ExAC) and Single-Nucleotide Polymorphism Database (dbSNP). Variants are also tested against COSMIC, a somatically acquired mutations databases and against the germline/somatic database ClinVar (National Center for Biotechnology Information ClinVar). Overall, the information in these databases is limited by variable accuracy. For example, dbSNP, despite being considered a source of germline variants, contains also pathogenic variants, whereas COSMIC contains also some germline variants. Because of the relatively poor ability of dbSNP to correctly classify germline variants in comparison for example with ExAC, these databases have been assigned a different priority within tumor-only filtration strategy.
When the workflow starts, TOSCA automatically downloads all the necessary data from Ensembl ftp server, including reference genome, annotations and databases. This operation may require some time, mainly depending on connection speed, but it is usually done only once at the first use of the workflow and won’t be repeated until the user decides to update their environment. The check inputs rule in the Snakefile can be also executed to make sure all the input data and the parameters in the configuration file have been correctly specified.
The output files together with benchmark and log files for each block are stored in a directory specified by user in the configuration file. The final result of somatic calling is combined with all information retrieved from database and annotation files and saved as text file together with an html report resulting from MultiQC analysis (Ewels et al., 2016).


	Supplementary Material

