Reproducible Bioinformatics Analysis Workflows for Detecting IGH Gene Fusions in B-Cell Acute Lymphoblastic Leukaemia Patients


	2. Materials and Methods

The RIGHT (Recovering IGH fusion Transcripts) workflow (Figure 2) was developed using the workflow management executor Nextflow, version 21.10.6 [28]. Nextflow facilitates the parallel resolution of the implemented algorithms, namely STAR-Fusion, version 1.10.0 [24], Arriba, version 2.1.0 [25], and FusionCatcher, version 1.33 [29], thus optimising the processing time and enabling reproducibility across distributed computing infrastructures. STAR-Fusion and Arriba are often used together owing to their high accuracy and short execution times on both simulated and real data [24,25]. Adherence to the recommended best practice for mRNA-seq data analysis involves employing at least two distinct gene fusion calling algorithms, with intersected results for improved accuracy [25]. The GRCh37 human reference genome [30,31] was used with the Ensemble reference transcript annotation [32].
To evaluate the ability of the workflow to efficiently analyse IGH fusions, a subset of 35 Australian B-ALL patient samples that were referred to the South Australian Health and Medical Research Institute (SAHMRI) (Adelaide, Australia) for genomic testing were selected (Supplementary Table S1), all of which were confirmed to harbour IGH fusions (IGH::CRLF2 n = 17, IGH::DUX4 n = 15, IGH::EPOR n = 3). These 3 gene fusions were chosen as they are the most common IGH fusions seen at SAHMRI. IGH::CRLF2 fusions were confirmed via fluorescent in situ hybridization (FISH) using two separate break-apart probes, a Vysis IGH dual colour probe (Abbott) and a CRLF2 dual colour probe (Cytocell). Gene expression profiling was used to confirm IGH::DUX4 fusions, and the visual inspection of BAM files in the Integrative Genomics Viewer (IGV) was needed to confirm the presence of IGH::EPOR (Supplementary Data S1). We also analysed a second subset of 25 B-ALL patient samples that contained non-IGH gene fusion events, and non-translocated samples (Supplementary Table S9) as a control. These 60 samples were chosen from a previously published cohort of 180 B-ALL patient samples [33] that represented a range of common genomic lesions associated with ALL subtypes (EGA Accession: EGAS00001006460) [34]. The process of preparing libraries for mRNA sequencing was carried out using either the TruSeq Stranded mRNA LT Kit (Illumina, CA, USA) or the Universal Plus mRNA-Seq with NuQuant (Tecan, CA, USA). This was performed using 400 ng of total RNA, as per the instructions provided by the manufacturers. Subsequently, the samples underwent sequencing on either the Illumina HiSeq 2000 or NextSeq 500 platforms. The outcome of this sequencing approach yielded paired-end (PE) reads with a length of 75 bases, and an average read depth of 70 million reads.
To assess the sensitivity of the developed workflow, we processed the patient samples using the default parameters for each algorithm (Supplementary Data S2). Given the initial output files, we compared the reporting rates for IGH fusions for each caller and determined that STAR-Fusion had the lowest reporting rate. We then assessed the output files that STAR-Fusion produced following each stage of filtering to determine why the number of reported IGH fusions was lower than the other callers. STAR-Fusion produces intermediate output files during filtering comprising pre- and post-blast-filter files, listing each gene fusion detected and whether they passed filtering or not. If they did not pass filtering, the filtering step at which failure occurred was listed, and why. Given this, we were able to trace each IGH fusion and determine which filtering steps failed. We were then able to adjust the appropriate parameters in STAR-Fusion and process the samples again to reassess the reporting rates (Supplementary Data S3). While we used the GRCh37 reference genome, as this is currently the primary reference used at SAHMRI when analysing B-ALL patient samples, this pipeline is also compatible with the newer reference genome, GRCh38.
