ILIAD: a suite of automated Snakemake workflows for processing genomic data for downstream applications


	Implementation



	Pipeline architecture and configuration file

Genomic data processing poses a challenge for genetic research studies because it involves multiple program dependency installations, vast numbers of samples with raw data from various next-generation sequencing (NGS) platforms, and inconsistent genetic variant ID and/or positions among datasets. The Iliad suite of genomic data workflows automates the central steps in genomic data processing for several NGS data types with implementation through the Singularity [20] container system and Snakemake workflow management system. These systems form the basis of Iliad and account for its ease of distribution, reproducibility, and scalability to ultimately accommodate users with a simplified and standardized suite of workflows that are easy to implement.
All but one of the required program dependencies for Iliad are contained within a pre-built singularity image file available on Sylabs cloud (library://ncherric/iliad/igdp-container:latest) that is automatically pulled down into the workflow by Snakemake. A Docker container solution is also provided ( https://hub.docker.com/repository/docker/ncherric/iliad/general) with options for AMD64 or ARM64 architectures. The Illumina IAAP-CLI is not permitted for distribution, so the download link is accessible to Iliad and users via the configuration file. The Singularity definition file and Dockerfile used to build the containers are also provided with Iliad on Github and ReadTheDocs in case there are any user-specific modifications to build a custom version. The container base system is Ubuntu 20.04 [21] Linux and includes BWA [22] for read mapping, SAMtools [19] and PicardTools [23] for user-choice of sorting and compressing the alignment files, BCFtools for variant calling, +gtc2vcf BCFtools plug-in [24] for converting SNP array files, and miniconda [25] for creating rule-based conda environments as needed. The latest Iliad container contains up-to-date versions of SAMtools and BCFtools, however, users can download previous versions if preferred by using a tag that corresponds to the version of software (e.g. library://ncherric/iliad/igdp-container:v1.16).
Iliad functions under Snakemake best practices and takes advantage of several useful features. Iliad works as an inference pipeline where the user can specify the desired endpoint, and Snakemake will then infer which rules are to be run based on the following user input: the final invocation of the Snakemake command, existing input/output files in the workflow, and a primary configuration file. The desired endpoint is declared as the input to the “rule all” found in each of the independent workflows’ main Snakefile. This is already pre-set based on the goal of each workflow. There are certain Snakemake flags that will affect the starting point, such as ‘--forceall’ which triggers all rules to be executed whether there are existing files or not. In a scenario where Iliad has been invoked and progress was interrupted, the workflow will begin where it left off saving both time and resources. This offers users a flexible entry point into the workflow, meaning researchers can begin using Iliad even if they possess data files corresponding to input that is mid-stream of a workflow.
All Iliad workflows refer to the primary configuration ‘config.yaml’ file which has numerous controller variables with clearly denoted purposes. This file is the central point of customizing user needs. Many of the adjustments are binary conditions that are dependent on files a user may already possess. The most important configurations can be found at the top of this primary file. For example, adding the working directory path location of the Iliad directory is required. Once that single change is made in the configuration file, Iliad will be able to execute a demo based on tutorial data. This provides users with an instant glimpse into the workflow and how it works, making it easier to begin working with new data. A thorough how-to guide for each workflow and a demonstration of the tool is provided on “Read the Docs” ( https://iliad.readthedocs.io/en/latest/).


	Modularization

Iliad was developed as a suite of workflows using the modularization capabilities of Snakemake (Fig.1). It includes data-specific pipeline modules that are designed for raw sequence data (FASTQ), Illumina SNP array data (IDAT), and a common storage format for sequence alignment data (CRAM). Each of these data types are common sequence data files. The user is free to choose which of these data processing modules best suit their needs by executing the corresponding Snakefile(s). Each of these modules provide flexible start and end points depending on any pre-existing data files and the Snakemake flags included in the command line invocation. Additionally, Iliad features independent submodules for lifting over reference assembly genomic positions (GRCh37 to GRCh38 and vice versa) and merging multiple VCF files at once. These submodules can be used independently or combined within the raw sequence processing modules.
 


	Computational specifications

Development and benchmarking took place on Carbonate [26] and Ulysses, respectively. Carbonate is Indiana University’s large-memory computer cluster designed for data-intensive tasks. It is home to a cluster of 72 Lenovo NeXtScale nx360 M5 server compute nodes each with 12-core Intel Xeon E5–2680 v3 CPUs and 256 GB of RAM. There are additionally 8 large-memory compute nodes containing 512 GB of RAM, but those were not necessary for the purpose of Iliad. Ulysses is a remote server with more immediate access for the Department of Biology at Indiana University Indianapolis (IUI). Ulysses is comprised of two 16-core Intel Xeon Gold CPUs. Both Carbonate and Ulysses surpass the hardware necessary to run Iliad. It is also possible to execute Iliad workflows on a local machine such as a desktop or laptop equipped with Linux, MacOS, or Windows operating systems (OS) possessing at least 1 core, 16 GB of RAM, and enough disk space storage to sustain the amount of data needed. We recommend having as much computational storage as possible for sample processing, but this will vary depending on the nature of the research. We provide the minimum system requirements necessary to execute the demo tutorial. RAM and CPU usage metrics were collected using the collectl utility ( http://collectl.sourceforge.net/) and the built-in ‘benchmark’ declarative in Snakemake.


	Raw sequence read data workflow

The raw sequence read data module is comprised of 24 rules (Additional file 1: TableS1; Additional file 1: FigureS1), not including the rule ‘all’, which is designated as the driver of the workflow to provide the final desired output VCF. Each of these rules has been optimized using the “resource” directive, allotting for a specific time and memory request specific to each rule. Some rules may be branched into hundreds of jobs based on the number of samples. Job scheduling systems are more likely to run a queued job with smaller dedicated time and memory requests for smaller tasks. Some of the rules, such as downloading annotations files, are only necessary to be run once. Iliad will re-use those general annotation files for later runs and cache them so that the other Iliad modules have access and can skip redundant downloads. There are instances when input data, such as a reference genome assembly FASTA file, is already locally available on a system. If so, adding ‘true’ to the ‘IhaveReference’ variable in the configuration file will trigger its automatic use. Similarly with input FASTQ reads or CRAM alignment files, placing the data into the appropriate directory and declaring the ‘--ignore-incomplete’ flag in command line invocation is all that is required if a user does not need the automatic downloading feature.
The main workflow handles raw sequence data which can be used for reference or target data. The first advantage is a download checkpoint that uses ‘wget’ to acquire the user’s FASTQ data from the specified URL in the config file. Since many studies include multiplexed sequencing runs across many lanes, all the files associated with a particular sample name will be downloaded into a temporary ‘downloads’ folder where they will be accessed for a concatenation rule and output as one set of unzipped paired-end reads into the ‘fastq’ folder. If the user already has FASTQ data placed in the ‘fastq’ directory, the checkpoint will be satisfied. Quality control of the FASTQ data is performed via FASTQC [27], and reports are generated in HTML format which is a valuable step for users to check the raw sequence quality of downloaded data. Iliad will proceed with completing other rules based on the input data. Since read mapping is next for FASTQ data, the rule for obtaining the reference genome indicated in the configuration file will begin to produce the remaining and necessary input files that the read mapping rule requires. The reference genome is retrieved using a script modified from the reference wrapper, ‘0.74.0/bio/reference/’, in the “dna-seq-gatk-variant-calling” workflow found on the Snakemake workflow catalog [28]. With a specific reference assembly and corresponding index file, Iliad will then begin read mapping using the burrows-wheeler alignment (BWA) package (v0.7.17). The main workflow, then, pipes the BWA output (SAM file) to Samtools ‘sort’ and creates a sorted BAM file. Sorting of the file through a pipe eliminates the need for a higher storage capacity by reducing intermediate files. After this step, the main workflow implements the BCFtools variant caller to produce VCFs that contain the genotypic information. Due to user configuration included in this workflow, variant calling parameters set for BCFtools can be adjusted if the user wishes to set specific thresholds or flags for the ‘mpileup’ and ‘call’ algorithms. The same is true if application of the ‘norm’ flag is desired. This allows Iliad to keep up to date with any new developments in BCFtools variant calling scripts.
Rather than performing variant discovery and calling all possible variants found in the alignment file, Iliad utilizes a curated list of variants (n = 120,046,375) from the New York Genome Center [29] to perform genotyping as this list comprises of stable variants observed across the 1000 Genomes Project dataset. A rule in the main workflow automatically downloads these files from the associated FTP site [30]. There is one file per human chromosome, 23 in total. The next rule splits each chromosome into equally divided chromosomal regions to further mitigate the computational reading and writing time observed when using BCFtools ‘view’ on one chromosome file in its entirety. The number of chunked region files can be customized in the configuration file by the user to fit any system-specific requirements. Any ambiguity in chromosome naming conventions is also handled within this rule. The chunking methodology is used to drastically increase speed either in series or in parallel, however, using workers in parallel is the faster approach. We chose BCFtools as our variant calling software because it provides a wide array of filtering options and useful plugins that maximize user customization and data flexibility. Within the Iliad workflow this gives users the ability to modify BCFtools commands as needed, particularly when new versions of the software become available. It has consistently been one of the preferred performance-evaluated variant calling tools [31–34] for sequencing data whilst including a capability for analyzing other data types (i.e., microarray data). Furthermore, the implementation of BCFtools concatenation and merging features complement the chunking methodology to optimize VCF generation of multiple samples from whole genome sequence (WGS) data.


	Stored sequence read data workflow

Sequence alignment files (SAM or BAM), especially for the human genome, create impending hard disk storage challenges that can become quite costly. Therefore, compressed sequence alignment (CRAM) files are a very popular storage file format commonly found on publicly available project FTP sites such as 1000 Genomes Project [5], Human Genome Diversity Project [6], and Simons Genome Diversity Project [7]. The CRAM file format is continually undergoing modifications and updates to improve speed and accuracy [35] therefore this workflow is particularly up to date with developments in data compression. Iliad incorporates a stored sequence read data module that downloads desired open-source CRAM data from a server and performs the above-mentioned steps for variant calling on the retrieved files, just as it would perform variant calling on sorted BAM files in the raw sequence module. This is a critical module that supports in-house development of WGS reference panels and enables a fast and efficient addition of standard reference data sets that are publicly available. Iliad is one of the first Snakemake workflows that specifically manages the automation of CRAM to VCF data processing using BCFtools and user-controlled software flags. It is important to note, especially for new users, the exact same genome reference assembly that was used by the research group that generated the CRAM data is required. Iliad’s configuration file provides a binary variable to declare which reference genome assembly must be used and the reference genome file path if it must be supplied by the user. For example, variant calling CRAM files from HGDP [6] require the ‘GRCh38_full_analysis_set_plus_decoy_hla.fa’ reference genome.


	SNP array data workflow

An important feature of this pipeline is the ability to integrate the processing of SNP array data from its raw IDAT form. Typically for Illumina-specific SNP array files, the data must either be uploaded to a Windows only software (GenomeStudio) or utilize numerous command line applications. To simplify this entire process and seamlessly integrate Illumina SNP array data processing into our workflows, we containerized the open-source programs and included download steps for programs with end user license agreements, such as IAAP-CLI.
With the proper tools in the workflow environment, the procedure to obtain a VCF is facilitated by passing the raw data through the appropriate conversion steps. Initially, the data must be physically located in the “./Iliad/data/snp_array/idat/” directory, so that the IDAT files can be converted to GTC files using IAAP-CLI ‘gencall’. There are product and support files that assist this conversion which include manifest and cluster files. It is imperative that the user knows which reference assembly their study will need and to indicate as such in the configuration file for their automatic retrieval. The Infinium Multi-Ethnic Global-8 v1.0 microarray web links for product and support files corresponding to Homo sapiens GRCh37 and GRCh38 reference assemblies are included and may provide assistance to users locating their specific array support documentation. The correct product and support files will be used as input for follow-up third party tools to continue the conversion.
A BCFtools plugin, ‘+gtc2vcf,’ is included in the singularity container and is called to convert all sample GTC files into one VCF file located in the same directory (“./Iliad/data/snp_array/gtc/”). Illumina support files, acquired automatically, update obscure loci names to widely accepted rsIDs. Although this may be a sufficient endpoint for some analyses, we included a filtering step that will find overlapping rsIDs from the dbSNP annotation VCF from NCBI [36]. Users can configure which dbSNP file corresponds with their desired genome reference assembly to produce a quality VCF that can be easily joined with other genomic data using standard rsIDs. Finally, raw IDAT files provide metadata for GenTrain and ClusterSep scores for every variant and can be used to filter out calls of poor quality. At present, there are default upper and lower thresholds for GenTrain (0.7 and 0.67) and ClusterSep (0.45 and 0.4) built into the workflow, but these can be adjusted according to user preferences. Final outputs include a summary graph and a quality controlled VCF file.


	Submodules for additional VCF optimization

To best serve additional data processing applications, Iliad features several submodules that assist with build conversions and the merging of VCF files with other datasets available to the user e.g., reference data. Variant genomic positions largely depend on the reference assembly used for alignment, therefore datasets from different sources may have varying VCF ‘POS’ fields that inevitably represent the same SNP but cannot be merged correctly. The most comprehensive submodule functions as an automatic Lift-over and Merge task (Fig.2). Users simply “drag and drop” their datasets into the directory (‘./Iliad/data/vcf_Lift-and-Merge’), provide a project name and reference assembly preference in the configuration file, and list the files to merge in the text file (‘./Iliad/config/mergeTheseVCFs.txt’). Doing so results in a tidy project space dedicated to a specific project and build conversion. The submodule enables users to perform multi-VCF merges on compressed or decompressed data represented by Homo sapiens GRCh37 or GRCh38 positions. Once configuration variables have been set and the ‘Lift-and-Merge_Snakefile’ executed, the pipeline detects and updates genomic positions and naming conventions, then initiates the merge of autosomes and the X chromosome using BCFtools. The submodule also automatically detects which version (GRCh37 or GRCh38) each VCF file is before passing it to the correct processing channel for generation of the final build expressed by the user. Processes to filter the independent VCFs and quality check the final merged VCF then occurs, also based on user configuration (maximum SNP and individual missingness).
The Lift-over and Merge options also have split submodules for ease of use. There are two independent lift-over options that convert Homo sapiens GRCh37 positions to Homo sapiens GRCh38 positions and vice versa. The ‘liftoverTo37_Snakefile’ or ‘liftoverTo38_Snakefile’ must be passed to your Snakemake command and VCF files migrated to the ‘./Iliad/data/liftover’ directory. These perform a lift-over on VCF(s) but do not merge them. The data merging submodule can be performed on its own or with integration in the main sequence and SNP array modules with the execution of the ‘targetRefMerge_Snakefile’. This is especially useful when processing both reference and target data for a particular analysis.
 


	Availability and implementation

The Iliad genomic data pipeline is open source and can be found on GitHub ( https://www.github.com/ncherric/Iliad). It is easy and straightforward to setup on several operating systems using containers and is considered an ‘out-of-the-box’ suite of workflows thanks to the thorough documentation and visual how-to guides that complement Iliad ( https://iliad.readthedocs.io/). Program dependencies and external downloads of supplementary files are automatically facilitated by Iliad. This suite of genomic data processing pipelines was tested using Windows, MacOS, and HPC Linux systems using both Singularity and Docker containers.
 
