iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data


	MATERIALS AND METHODS

The first step towards developing iCOMIC was to compile a set of tools following best practices for DNA-Seq and RNA-Seq analysis. We identified the most widely used tools for alignment of reads, variant calling, variant annotation, expression modeling and differential expression analysis (Figure1, Table1) (17). Two of the most cited and widely used tools, FastQC ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Cutadapt (18), were chosen for quality control.
Aligner tools incorporated in iCOMIC for DNA-Seq data include BWA-MEM (19), GEM-Mapper v3 (20) and Bowtie 2 (21). Apart from the three major steps involved in analyzing whole-genome sequence data, several preprocessing steps are carried out intermediately following the GATK framework. This included sorting and indexing alignment files generated by the aligner, marking duplicates using Picard markduplicates ( https://github.com/broadinstitute/picard), and base quality score recalibration followed by filtration of the variants called. Variant callers include GATK Mutect2 (22), samtools mpileup (23), FreeBayes (24) and GATK Haplotype-Caller (25). Mutect2 identifies variants in normal-tumor pairs while the rest of the tools perform variant calling in a given sample analogous to the reference genome. SnpEff (26) and ANNOVAR (27) were the tools selected for variant annotation. The tool MultiQC (28) was used to aggregate the results obtained from the numerous tools in the workflow. In the case of whole genome/exome data, the MultiQC report was generated based on the results from tools such as FastQC and SnpEff. Two cancer-related data analysis tools, NBDriver (29), which uses a machine learning approach to identify the context of mutations, and cTaG (30), a tool to predict whether a given gene is a tumor suppressor (TSG) or an oncogene (OG), were also incorporated in the final pipeline (Figure2).
For RNA-seq data, HISAT2 (31) and STAR (32) were selected to perform alignment following quality control. Expression modellers such as StringTie (33) and HTSeq (34) were included to count the number of reads aligned to the genome. For differential expression analysis, ballgown (35) and DESeq2 (36) were incorporated. The MultiQC report for RNA-Seq data includes results from tools such as FastQC, Cutadapt and STAR. Normalization is inherently carried out by the tool used for differential expression (Figure3).
iCOMIC is embedded in the popular workflow management system, Snakemake (37). The analysis workflow has ‘rules’ as the building blocks, which describe the connection between the input and output (38) (Figure4). It enables easy connectivity between the different tools/software within the workflow. The ‘rules’ specify input files, output files, log files and wrapper/shell commands. The Snakemake wrapper scripts together with the conda environment manage the automated installation of software and their dependencies. Tools without a wrapper script are configured separately, and shell commands are used for its execution. Parallelization of workflows is managed by Snakemake. iCOMIC users have the privilege to choose the number of cores they need to run iCOMIC with, thereby allowing multiple jobs to be executed together (this option is given in the input tab). The number of cores provided by the user is passed as an input to each tool. The list of dependencies and the versions can be exported from the conda environment file ( https://github.com/RamanLab/iCOMIC/blob/main/icomic_env.yml).
The details of individual tools incorporated in iCOMIC are available here ( https://icomic-doc.readthedocs.io/en/latest/walkthrough.html).
According to the user's selection, appropriate rules are combined in a ‘Snakefile’ to generate the target output. The input file paths and parameters set by the user are automatically fed into a configuration file, referred to as the ‘config file.’ Multiple config files and Snakefiles are auto-generated for the quality check of the input data, generating genome index, and executing the main workflow.
We implemented the iCOMIC pipeline in the form of a Graphical User Interface (GUI). The GUI has been developed using PyQt5, a Python binding of the cross-platform GUI toolkit Qt. The GUI retrieves the user input files and the parameters and communicates with the Snakemake rules to set up the analysis. A Python wrapper binds together the PyQt5 GUI and the Snakemake workflow of iCOMIC. After completing the initial requirements and creating the conda environment, the iCOMIC GUI can be accessed using a single command, ‘icomic’. Additionally, running iCOMIC in Linux platform is supporte by docker.


	RESULTS AND DISCUSSION



	Major features of the pipeline

NGS data has become an indispensable tool for biological research, although the data analysis can be daunting for non-bioinformaticians. iCOMIC has been introduced to overcome this concern to a certain extent. It serves as a stand-alone end-to-end analysis toolkit for DNA-Seq and RNA-Seq data. The DNA-Seq component of iCOMIC supports both germline and somatic variant calling. In conventional analysis pipelines like Galaxy, workflows need to be built, whereas iCOMIC has various inbuilt pipelines that automatically transfer output from one tool to the next. iCOMIC provides an interactive and user-friendly GUI, specifically created to accommodate users with minimal programming expertise. On another note, iCOMIC allows expert bioinformaticians to perform analysis incorporating additional tools and advanced parameters, saving time building the pipeline. The steps to be followed for writing new rules for integrating additional tools are detailed in the documentation at https://icomic-doc.readthedocs.io/en/latest/. The GUI is embedded in a Python wrapper script that connects the Snakemake workflows. Users can select an array of tools from the predesigned combinations best suited for their requirements. The best connectivity between the tools has been taken into account to design these individual workflows. iCOMIC provides the user with the necessary means to replace modules or alter the pipeline. Furthermore, the conda environment ensures easy installation of tools and dependencies. A detailed description of the structure of GUI together with step-by-step screenshots of analyzing an example pipeline is provided in the Supplementary Methods [Sections 1–3].


	Prediction of tumour suppressor genes and oncogenes using the cTaG algorithm

The cTaG (classify TSG and OG) model identifies driver genes by classifying them as tumor suppressor genes (TSGs) and oncogenes (OGs). Given a cohort of samples, the pan-cancer model calculates ratio-metric features from somatic mutation data, capturing mutations' functional impact. Unlike other computational methods that use background mutation rate (BMR) to identify genes with a higher mutation rate as driver genes, cTaG captures the effect of a mutation on the gene's functionality. Methods using BMR are biased towards genes with high mutation rates (39), and we know that while few driver genes have a high mutation rate, most do not. The mutations in TSG and OG differ; we found nonsense mutations more commonly found in TSGs than OGs. The cTaG method contains binary classifiers that classify genes as TSG or OG. The genes are labeled as TSGs or OGs based on consensus across various models.
To build a pan-cancer model, the model was trained on somatic mutation data from COSMIC (v79) (40) from different cancer types. We used ratio-metric and entropy features to classify genes as TSG or OG. The cTaG model uses the random forest classification algorithm to generate the pan-cancer model. A filtered list of genes from the Cancer Gene Census (CGC) (41) is used to label genes as TSG or OG, used to train the pan-cancer model. The pan-cancer model successfully identified tissue-specific driver genes when employed on a cohort from single tissue of origin. The method takes a single maf file with annotated mutations for the cohort of samples as input and generates each gene's ratio-metric and entropy features. Additional arguments such as the percentile and threshold to define highly-mutated samples can be specified. The percentile argument defines the top percentile genes to be considered from predictions made by each model for the final consensus. The default value is 5. Highly mutated samples are skipped from the analysis. By default, samples with >2000 mutations are omitted during analysis. The cTaG model labels these genes as TSG and OG based on the consensus across the random forest models. The model returns the list of all genes and their labels predicted by each model, along with its presence in the top percentile. Our method identifies genes with high as well as low mutation rates. The pan-cancer model also predicts tissue-specific driver genes.


	Prediction of driver and passenger mutations using NBDriver

Differentiating between driver and passenger mutations from sequenced cancer genomes is essential to targeted therapy and precision medicine. Despite the dramatic advances in developing predictive algorithms to differentiate between driver and passenger mutations, very few have concentrated on utilizing the local sequence context as potential features for further analysis. To capture this information, we built a robust machine learning model called NBDriver, which uses raw nucleotide sequences surrounding cancer-causing mutations as features to build machine learning models.
Our training data consisted of missense mutations from 58 genes containing experimentally validated functional impacts from several studies. To obtain a numerical representation of the sequence features surrounding the mutations, we used commonly used natural language processing tools such as the TF-IDF Vectorizer, the Count Vectorizer and the One-Hot Encoder. Using kernel density estimation techniques, we showed that the underlying distributions of the neighbourhood sequences surrounding driver and passenger mutations are significantly different from one another. We utilized this information to build robust machine learning models using a repeated cross-validation strategy and report the median values of the performance metrics for each feature-classifier pair. To increase the prediction performance, we integrated sequence features derived from raw nucleotide sequences with other genomic, structural, and evolutionary features, resulting in the development of a pan-cancer mutation effect prediction tool, NBDriver, which was highly efficient in identifying pathogenic variants from five independent validation datasets. An ensemble predictor containing NBDriver, CONDEL (42) and MutationTaster (43) outperformed existing pan-cancer models in prioritizing a literature-curated list of driver and passenger mutations. Considering only the true positive mutation predictions from NBDriver, we identified a list of 138 driver genes with known functional evidence from multiple sources. Overall, our study underpinned the efficacy of utilizing raw nucleotide sequences as features for building robust machine learning models to distinguish between driver and passenger mutations.


	Benchmarking of tools



	Evaluating the performance of germline variant calling pipelines

Performance validation of the germline variant calling workflows available with iCOMIC was done using Genome In A Bottle (GIAB) benchmark sets, based on Zook et al. (44). To this end, we ran 18 different combinations of aligners, germline variant callers, and annotators integrated with iCOMIC on the widely used benchmark dataset NA12878/HG001 to obtain separate vcfs (Table2). Generated vcfs were then compared with GIAB/NIST HG001 v2.19 truth data, restricting the comparison to the GIAB v2.19 BED file coordinates. We adopted the benchmark framework developed by the Global Alliance for Genomics and Health (GA4GH) benchmarking team (45) for vcf comparison. The method involved the generation of an intermediate vcf with a standardized variant representation using Vcfeval by Real-Time Genomics (RTG) tools (46) followed by quantification of performance metrics using qfy.py, which is a part of hap.py benchmarking toolkit. The summary of accuracy measures is highlighted in Table2. The analysis was performed, specifying ten threads for running each tool for all the workflow combinations. iCOMIC provided the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, for the BWA MEM - GATK HC pipeline among all the combinations (Table2). Time consumed for analyzing the GIAB benchmark set for BWA MEM—GATK HC—SnpEff pipeline is provided in Section 4a of the Supplementary Methods.


	Evaluating the performance of RNA-Seq pipelines

The validation of the performance of RNA-Seq workflows available with iCOMIC was done using the Human monocyte RNA-Seq dataset from NCBI-SRA (SRP082682). We ran four different combinations of aligners, expression modellers and differential expression tools integrated with iCOMIC on the RNA-Seq benchmark dataset to obtain differentially expressed genes between classical and non-classical monocytes. Comparison of differentially expressed genes between the genes identified and the non-classical and classical monocytes from the reference microarray dataset (GSE34515) obtained from the NCBI GEO database was performed. The fold change correlation(47) was calculated between reference microarray and RNA-Seq data. Based on the fold change correlation computed for the four different pipelines, it is evident that HISAT2-StringTie-ballgown and STAR-StringTie-ballgown pipelines performed the best with a correlation coefficient of r = 0.85, higher than the values obtained for HISAT2-HTSeq-DESeq2 and STAR-HTSeq-DESeq2 pipelines (Figure5). Time consumed for analyzing the benchmark dataset using STAR-HTSeq-DESeq2 is available in Section 4b of the Supplementary Methods.


	Feature comparison between iCOMIC and other bioinformatics pipelines

There exist many pipelines that integrate different bioinformatics tools for genomic data analysis. A comprehensive feature comparison with previously developed workflows (snakePipes, Sequanix, Omics Pipe, Galaxy (48), GenPipes (49), CANEapp, ARMOR (50), Galaxy, VIPER (51), systemPipeR (52), CLC Genomics Workbench and nf-core (53)) was performed to highlight the significance of iCOMIC (Table4). The accessibility of the pipeline through GUI, availability to the public, cloud support, the ability of automated execution of an entire pipeline, user freedom to choose tools of interest, and programming skills required for the analysis include some of the features used for comparison. Only those tools that share common features with iCOMIC were considered for comparison. In addition, a comparison with the most popular bioinformatics tools like Galaxy and CLC genomics workbench was performed. Galaxy, one of the customary scientific platforms for bioinformatics data analysis, does not afford inbuilt pipelines, whereas CLC genomics workbench is not open source. The comparative analysis highlights the open-source GUI with end-to-end automated analysis integrating a plethora of tools as the strongest aspects of iCOMIC.


	Performance comparisons between iCOMIC and GALAXY

Galaxy is a popular publicly available web-based interface consisting of many tool combinations and workflows to perform various studies. It offers ease of access to complex computational analyses to users with minimal programming expertise and the functionality to examine large datasets in a multi-step process. Considering the similarities in features between iCOMIC and Galaxy, pipeline validation was conducted using the same methods to establish a comparison scale for performance.
Pipelines are fully automated in iCOMIC and the most predominantly used workflows are pre-built. While workflows need to be specified in Galaxy in order to automate, iCOMIC comes with a set of preinstalled pipelines for both DNA and RNA-Seq analysis. In iCOMIC, the user has the option to specify a large number of files in the form of a table, while no such provision is available on Galaxy. Although extensive resources and tutorials are provided for the use of Galaxy, we believe it has a steeper learning curve than iCOMIC, which is mostly a point-and-click type application. Certain tools are unavailable on the Galaxy server and need to be installed from the tool shed, while iCOMIC installations are readily done via the conda environment, with minimal user input/interference.
While analysing multiple samples on a workflow in Galaxy, it is required to rename the files prior to passing it as input to the next tool in the pipeline which can be tedious when there is a large number of samples. On the other hand with iCOMIC, the entire process is automated. Undoubtedly, Galaxy has its advantages, and is a very popular pipeline for genomic data analysis. Yet, iCOMIC excels in its simplicity, and we believe it will be more inviting for biologists and clinical researchers to quickly analyse their data.


	Comparison in terms of germline variant calling

DNA-Seq analysis was performed using BWA-MEM, Freebayes, and SnpEff for each step in the analysis workflow using GIAB dataset. Performance validation for the same was done using a process similar to that of iCOMIC. The variant files obtained as output from this pipeline were compared with GIAB/NIST HG001 v2.19 truth data restricting the comparison to the GIAB v2.19 BED file coordinates. Vcfeval method was used for vcf comparison, and the quantification of performance metrics was computed using the hap.py algorithm. Comparison of the performance metrics such as F1 and precision scores between iCOMIC and Galaxy indicates that the values are pretty similar. Considering the SNP F1 score, iCOMIC has 0.988 and Galaxy has 0.980, proving that the performance of iCOMIC is on par with that of Galaxy (Table3).


	Comparison in terms of differential expression analysis

RNA-Seq analysis was performed using the tools STAR, HTSeq, and DESeq2 for each step in the analysis workflow using the Human monocyte RNA-Seq dataset from NCBI-SRA (SRP082682). The fold change correlation was calculated between the reference microarray dataset (GSE34515) obtained from the NCBI GEO database and RNA-Seq data for iCOMIC and Galaxy. The values were found to be 0.8 and 0.66, respectively. Comparison of the fold change correlation between iCOMIC and Galaxy indicates that the value of iCOMIC is higher than that of GALAXY in the particular pipeline (Figure6).


	Supplementary Material

