HemoMIPs—Automated analysis and result reporting pipeline for targeted sequencing data


	Design and implementation

The hemoMIPs pipeline enables hemophilia screening of (typically) 384 patients on a single Illumina NextSeq run.Fig 1 outlines the general workflow and the following sections describe data processing and analysis in more detail. All steps are implemented in the workflow management software Snakemake [7] and rely on Conda predefined environments to manage software dependencies and easy deployment.


	Primary sequence processing

The primary inputs are raw FastQ files from the sequencing run as well as a sample-to-barcode assignment (seeS1 Text). In primary processing, reads are converted to BAM format, demultiplexed (storing sample information as read group information), and overlapping paired-end reads are merged and consensus called [11].


	Alignment and MIP arm trimming

Processed reads are aligned to the reference genome (here GRCh37 build from the 1000 Genomes Project Phase II release) using Burrows-Wheeler Alignment (BWA) 0.7.5 mem [12]. As MIP arm sequence can result in incorrect variant identification (by hiding existing variation below primer sequence), MIP arm sequences are trimmed based on alignment coordinates and new BAM files are created. In this step, we are using MIP design files from MIPgen [5] by default. Alternative input formats are described inS1 Text. MIP representation statistics (text output file) are calculated from the aligned files. Further, reads aligning to the Y-chromosome-unique probes (SRY) are counted for each sample and reported (text output file).
In a separate alignment step, all reads are aligned to a reference sequence file describing only the structural sequence variants as mutant and reference sequences. Results are summarized over all samples with the number of reads aligning to each sequence contig in a text report.


	Coverage analysis and variant calling

Coverage differences between MIPs are handled by down sampling regions of excessive coverage. Variants are genotyped using GATK [13] UnifiedGenotyper (v3.4–46) in combination with IndelRealigner (v3.2–2). Alternatively, GATK v4.0.4.0 HaplotypeCaller is used in gVCF mode in combination with CombineGVCFs and GenotypeGVCFs.
The hemophilia datasets perform similar when run either with the GATK3 or GATK4 workflow. However, in low quality genotype calls the performance might vary and a different call set might be obtained. In a reanalysis performed on one of the hemophilia sequencing experiments, the sample specific genotype agreement is above 0.99 (36 different out of 64,308 genotype calls) between the two GATK versions, with high agreement in associated genotype qualities (S1 Fig,S1 Data). We therefore choose GATK4 as the standard setting for the workflow as this versions maintains support, is 50x faster and can be more easily upgraded.
Variant annotations of the called variants, including variant effect predictions and HGVS variant descriptions are obtained from Ensembl Variant Effect Predictor [14].


	Report generation

Different HTML reports are generated for visualization, interpretation and better access to all information collected in previous steps. There are two entry points to this information, organized as two different HTML reports–one summarizing all variant calls and MIP performance across samples and the other summarizing per-sample results in an overview table. The first report (summary.html,S2 Fig) provides a more technical sample and variant summary, per region coverage and MIP performance statistics. This report across samples can be used to assess assay performance (e.g. underperforming MIPs could be redesigned in future assays) and allows identification of suspiciously frequent variants (common variants or systematic errors).
The second report (report.html) provides an overview of results for each sample (Fig 2), highlighting putative deleterious variants and taking previously defined common/known benign variants out of focus (gray font). Additional information is provided about potential structural variants and incompletely covered regions. This table also provides an overall sample status field with information about passing and failing samples, as well as flags indicating outlier MIP performances.
Both reports provide links to individual report pages of each sample. The individual reports (ind_SAMPLENAME.html,S3 Fig), provide quality measures like overall coverage, target region coverage, read counts underlying the inferred sample sex and MIP performance statistics (over- or underperforming MIPs in this sample), but most importantly provide detailed information on the identified variants, structural variant call results and regions without coverage (potential deletions).


	Report tables in CSV format

In additional to the HTML output files for visualization, results are also presented in computer readable CSV format files. These CSV files can be joined by either the variant or sample specific identifier columns. The following results are summarized in the respective table files:
ind_status.csv outputs the sample sex inferred from SRY counts, reports outlier MIP performance, number of genotype (GT) calls, covered sites within the MIP design regions, average coverage, heterozygous sites, incompletely covered regions, deletions as well as a textual summary in a sample quality flag (e.g. OK, Failed Inversions, Check MIPs).
variant_calls.csv and variant_calls_benign.csv contain all or just benign variants, respectively, with location, genotype, quality scores, allelic depth, coverage and status information.
variant_annotation.csv provides additional annotations to called variants based on reference and alternative allele information. These annotations include gene name, exonic location, cDNA and CDS position, HGVS Transcript and Protein information, variant rsID, and 1000G allele frequency.
inversion_calls.csv contains count results for MIPs targeting predefined structural variants.
