SNAPPy: A snakemake pipeline for scalable HIV-1 subtyping by phylogenetic pairing


	2. Implementation



	2.1 SNAPPy architecture

The SNAPPy pipeline was built on the Snakemake workflow management system (Koster and Rahmann 2012). Several tools/software were used to perform different tasks within this pipeline: MAFFT v7.245 (Katoh and Standley 2013) for multiple sequence alignment (MSA); the Biopython v1.72 (Cock et al. 2009) module SeqIO and the ETE Toolkit (ete3) v3.1.1 (Huerta-Cepas, Serra and Bork 2016) for data parsing and manipulation; BLAST v2.7.1 (Camacho et al. 2009) for local database search; IQ-TREE v1.6.9 (Nguyen et al. 2015) for phylogenetic inference. Other Python v3.6 (Van Rossum and Drake 2009) packages were also used to create tests, pytest (Krekel et al. 2019), and data manipulation, numpy (Oliphant 2006) and pandas (McKinney 2010). For package management and to create a contained environment for SNAPPy, we recommend Conda (Anaconda Software Distribution 2019), and provided a ready to use file to this end (‘environment.yaml’) as well as instructions on how to install and utilize it in SNAPPy’s documentation page (Araújo, Martins and Osório 2019).
The term ‘target’ used in this manuscript refers to the file currently being processed by SNAPPy. When used for subtyping, SNAPPy performs the following tasks to a given target sequence: 1) split the input in multiple single FASTA files; 2) alignment to the reference genome; 3) BLAST against a set of HIV-1 reference sequences; 4) perform phylogenetic inferences using the BLAST top hits, the target, and an outgroup sequence; 5) sliding window BLAST against a database of HIV-1 reference sequences; 6) concatenation and analysis of the results obtained in the previous tasks and creation of the output results.Fig. 1 is a schematic representation of this pipeline. At the end of the subtyping task, SNAPPy produces two files the ‘subtype_results.csv’ and the ‘report_subtype_results.csv’, corresponding to a simplified version of the subtyping result and a more extensive report of all the outputs created by SNAPPy.

Alternatively, as any other Snakemake (Koster and Rahmann 2012) pipeline, intermediate tasks can be performed without the execution of the entire pipeline, making SNAPPy extremely useful for HIV-1 MSA (Fig. 1, Point 7). To match the names of the intermediate files created by SNAPPy and the header of the target sequences, a file named ‘keys_and_ids.csv’ is created. An in-depth description of each of these general steps can be seen in the following sections.


	2.1.1 Reference sequences

In all instances of SNAPPy, the HXB2 (GenBank: K03455) reference genome was used as a genomic position reference. The outgroup sequence used in the phylogenetic analysis corresponds to the CONSENSUS_CPZ sequence from the HIV sequence database (Theoretical Biology and Biophysics Group 2019) (Alignment type: Consensus/Ancestral, Year: 2002, Organism: Other SIV, Alignment ID: S02CG1). The creation of a comprehensive set of HIV-1 reference sequences proved to be a challenge. We based our dataset creation on the previously curated ‘HIV sequence database 2010 subtype reference genomes sequences compendium’ (Kuiken et al. 2010) and the ‘HIV Drug Resistance Database reference sequences’ (Shafer 2006). The final subtype reference dataset for SNAPPy consisted of 491 genomic sequences. It included references for groups N, O, P, and within group M for Subtypes B, C, D, G, H, J, and K, Sub-subtypes A1, A2, F1, and F2 and CRFs until the number 99. Please notice that some CRFs are not represented due to lack of at least two high-quality genomes available (CRFs numbered 30, 66, 75, 76, 84, 89, 91, 95, 97, and 98). The full reference dataset (491 sequences) is used for the BLAST task (see Section 2.1.3) and a subset only containing groups, subtypes, Sub-subtypes and CRFs 1 and 2 references (56 sequences) is used in the sliding window BLAST (see Section 2.1.5). For more information on these reference datasets please consult theSupplementary Table S1.


	2.1.2 Alignment to reference

After splitting the MSA into several single sequence FASTA files, each of them is aligned to the HIV-1 reference genome (HXB2). The module SeqIO from Biopython (Cock et al. 2009) is used to parse and manipulate the FASTA files in SNAPPy. The alignment is done using MAFFT (Katoh and Standley 2013). The alignment method used does not allow the insertion of gaps in the reference sequence. After the alignment is performed, the target sequence is trimmed to only contain the genomic region specified by the user in the ‘config.yaml’ file. Being the currently available options ‘GAG’, ‘PR’, ‘RT’, ‘PR-RT’, ‘INT’, ‘POL’, ‘ENV’, and ‘GAG-POL-ENV’, which correspond to the HIV-1 genomic regions with the same names in the HXB2 reference genome. The resulting files are them written to the ‘aligned’ folder.


	2.1.3 BLAST

The obtained alignments are them BLASTed against a local database of 491 HIV-1 reference sequences (see Section 2.1.1). For this task, BLAST (Camacho et al. 2009) is used. The results were sorted by bitscore (considering higher is better) and the best scoring result is outputted in the ‘report_subtype_results.csv’ file in the column ‘closser_ref’. The BLAST results are also used to make three groups of references sequences: containing the first forty-eight results; containing the first forty-eight results of only subtype references; containing the first forty-eight results of only CRF references. These three groups of reference sequences are then used in the phylogenetic analysis. The selected number of sequences (forty-eight) showed good compromise between analysis time and reproducibility, as discussed in Section 2.1.4. Since this BLAST task is done as preparation step to the phylogenetic analysis, it must have high sensitivity, so no related reference sequences are missed. To achieve this, the word size parameter was set to 10. We also restricted the cutoff E-value to 1.0e-10 to avoid the creation of large output files, without restricting the results. The intermediate files of the BLAST analysis are outputted to the ‘blast’ folder, being available for further consulting. For the split in subtype and CRF references in this step of SNAPPy, CRFs 1 and 2 are treated as subtypes, not CRFs. This decision was made based on the high prevalence of these CRFs and their ambiguous origin (Gao et al. 1996;Abecasis et al. 2007).


	2.1.4 Phylogenetic inference

To the three previously selected groups of forty-eight references (see Section 2.1.3), the target sequence and a non-HIV-1 sequence for rooting (see Section 2.1.1) are added. Obtaining three sets of fifty sequences that will serve as inputs for the phylogenetic analysis. Groups of fifty sequences showed to be contained and yet a comprehensive set of sequences to perform the phylogenetic inference. To perform the phylogenetic analysis, IQ-TREE (Nguyen et al. 2015) was used with the general time reversible (GTR) nucleotide substitution model, empirical base frequencies (+F), and a discrete Gamma model with four rate categories (+G4), with the fast tree search mode, and zero as seed number. The ETE toolkit (Huerta-Cepas, Serra and Bork 2016) was applied to parse and manipulate the phylogenetic trees created within SNAPPy. After rooting on the outgroup; it is inferred if the target sequence belongs to a monophyletic clade with sequences of one, and only one, subtype or CRF. If this happened, we consider that there is phylogenetic evidence of the relationship between the target sequence and a reference subtype/CRF. The result of this inference, together with the support values for that node (Shimodaira-Hasegawa like approximate likelihood ratio test with 1,000 replicates test, as implemented in IQ-TREE), are then outputted to the ‘report_subtype_results.csv’ file. Resulting in six output columns: ‘node_all_refs’, ‘s_node_all_refs’, ‘node_pure_refs’, ‘s_node_pure_refs’, ‘node_recomb_refs’, and ‘s_node_recomb_refs’. The intermediate files of the phylogenetic analysis are outputted to the ‘trees’ folder. The notation ‘all’, ‘pure’, and ‘recomb’ refers to the set of references used for that phylogenetic reconstruction.


	2.1.5 Sliding window BLAST

The sliding window BLAST can be performed in parallel with the tasks described in Sections 2.1.3 and 2.1.4, being only dependent on the outputs from the Alignment to the reference task (Section 2.1.2). Initially, the positions in the target sequence corresponding to gaps (‘-’) are excluded. For this task, BLAST (Camacho et al. 2009) is used. The length of the sliding window was set to 400 nucleotides, a size previously reported to allow phylogenetic inference in HIV-1 (Pineda-Peña et al. 2013). Smaller fragments/sequences are not processed by this method. The step size used is fifty nucleotides, creating eight bins for each window. The result for each BLAST window, and consequently its eight bins, is the subtype of the top result (bitscore) reference sequence. If more than one sequence of different subtypes has the same top score, the output for all bins of that window is null (‘-’). If the method fails to produce an output, the result for all bins of that window is null. After all possible sliding windows have been BLASTed, several bins will have multiple outputs. Then, a majority rule is applied to decide the final subtype for that bin. In case of a tie, the result for that bin is null.Fig. 2 contains a schematic representation of this process. The database used to BLAST against in this task only contains the group, subtype, sub-subtype and CRF1 1 and 2 references, as described in the references section (fifty-six sequences). The word size parameter applied was 30, with the purpose of obtaining high specificity. Values higher than this showed to cause instability in our tests (low reproducibility). An E-value cutoff of 1.e-50 was used to ensure the generated outputs were not too large, without losing real BLAST hits. The outputs of this inference are written to the ‘report_subtype_results.csv’ file in the column ‘recomb_result’ and the resulting files from these tasks are outputted to the ‘blast’ folder.



	2.1.6 Decision rules

The results generated by the full sequence BLAST, phylogenetic analysis, and the sliding window BLAST are then processed using a set of rules to produce the final output. These rules are executed in order, meaning that the second rule is only applied if the first rule criteria were not met, and so forth. The list of these rules can be consulted inSupplementary Table S2. At the end of the process, two final outputs are created in the snappy folder: ‘subtype_results.csv’ and ‘report_subtype_results.csv’, as mentioned earlier.


	2.1.7 System management

SNAPPy is a pipeline that performs several tasks, some of them generate a relatively large number of outputs. Therefore, in order to avoid wasting unnecessary disk space and simplifying the user experience, some of the intermediate files produced by SNAPPy are deleted before the end of the process. However, all the relevant files for the subtyping decision are kept and available for consulting after the pipeline finishes. At the end of each SNAPPy run, a snakemake hidden folder named ‘.snakemake’ is deleted because it occupies a substantial amount of space. However, this folder contains all the logs about the tasks performed and may be useful for debugging.
SNAPPy is distributed with a series of built-in tests created using pytest (Krekel et al. 2019). After installing SNAPPy or after making alterations in SNAPPy’s folder is recommended to run the tests to infer if the pipeline is behaving as expected.
SNAPPy’s documentation (Araújo, Martins and Osório 2019) includes detailed instructions on: pipeline installation; tutorials on how to use it; a list of available commands; an in-depth description of each pipeline step; FAQ; and how to cite and user license information.


	2.2 Pipeline evaluation



	2.2.1 Reproducibility

One of the bases of SNAPPy is phylogenetic inference, which has some stochasticity involved. As implemented in IQ-TREE (Nguyen et al. 2015), the initial topologies are constructed based on heuristic methods, which afterwards is optimized with maximum-likelihood rearrangements. In theory, this could lead to variance in the output of the subtyping pipeline. Furthermore, for the evaluation of the branch support, we were faced with the possibility of using statistic-based (e.g. Shimodaira-Hasegawa test) or sampling-based (e.g. bootstrapping) methods. In our tests, we found that the usage of bootstrapping approaches leads to some lack of reproducibility in the pipeline outputs, even at one thousand replicates, as previously reported (Pineda-Peña et al. 2013). Which may be explained by the low percentage of informative sites found in the tested HIV-1 sequences, which are extremely similar among each other. The increment of the bootstrapping replicates would lead to an exponential increment of the phylogenetic inference step computational time, making the pipeline much slower. Therefore, we decided to use a statistic-based branch support inference method, the Shimodaira-Hasegawa test, for phylogenetic inference in SNAPPy. As stated in the Sections 2.1.3 and 2.1.5, when using BLAST, the word size and cutoff parameters were selected to achieve the desired objectives (sensitivity or specificity) and ensure the stability of the analysis. After the pipeline was constructed, we performed 6 sets of 3 independent SNAPPy runs with a test set of 5,285 sequences (see Section 2.2.2) and compared the outputs of each independent run in terms of reproducibility. The obtained result was 100% reproducibility, meaning that the output file ‘subtyping_results.csv’ for each of the independent runs were exactly the same.


	2.2.2 Scalability

SNAPPy was built for large-scale analysis, taking advantage of modern multi core/thread CPUs. The usage of Snakemake (Koster and Rahmann 2012) as a workflow manager allows the construction of a directed acyclic graph of jobs, inferring which tasks need to be performed sequential and which can run in parallel. To infer the overall scalability of SNAPPy regarding multithreading, we performed the subtyping of a test set of 5,285 sequences with the following number of CPU threads: 1, 2, 4, 8, 16, and 32. The selection of these numbers was made having as objective the comparison of the computation time reduction by half (halving) when doubling the amount of computational resources. InFig. 3, there is a comparison of the real time that SNAPPy took to subtype the test set versus the expected halving time. This expected time reduction is purely theoretical and constructed based on the time SNAPPy took to subtype the test set with one core and subsequence duplication of the number of computational resources used. The performance regarding smaller test sets and for specific genomic regions was also evaluated and can be consulted inSupplementary Table S3. These tests were performed in a server with double Xeon E5-2680 2.50 GHz CPU (twelve cores/twenty-four threads), 128 GB of ram 2,133 MHz, in a SATA III SSD hard drive.

SNAPPy manages the generated files in order to give the user all the information needed to understand the results and decisions made. This feature is a tradeoff purposely made to give users the maximum amount of information without wasting disk space. Nevertheless, when used for large-scale analysis (tens of thousands of sequences), SNAPPy will create a large number of small files that together occupy a considerable amount of disk space. As an indicator, a SNAPPy run of 50,000 HIV-1 sequences occupied at the peak 59 GB and <4 GB after the depletion of the snakemake hidden logs folder.


	2.2.3 Subtyping methods comparison

The division of HIV-1 in groups, subtypes, and sub-subtypes is extremely valuable for epidemiological inferences (Abecasis et al. 2013;Yebra et al. 2015;Araujo et al. 2019). However, this division is a man-made construction that only makes sense in the eyes of a phylogenetic or epidemy reconstruction (Hemelaar 2013;Araujo et al. 2019). Therefore, it makes sense to argue that phylogenetic reconstruction is the gold standard for HIV-1 subtyping (Pineda-Peña et al. 2013;Fabeni et al. 2017), with the exception of recombination events that represent a deviation from the coalescent assumption (Pérez-Losada et al. 2015). Since SNAPPy is based on phylogenetic inference, using phylogeny to evaluate SNAPPy’s performance would be poorly informative. Given this limitation, we decided to test SNAPPy against a set of other HIV-1 subtyping methods, evaluating their convergence and divergence. The selected HIV-1 subtyping methods were REGA v3.0 (Pineda-Peña et al. 2013), COMET v2.3 (Struck et al. 2014), and SCUEAL (Kosakovsky Pond et al. 2009). This selection was made based on our best knowledge of available tools including statistical and phylogeny-based methods.
For this comparison, a test set of 5,285 sequences was created (the ids of the test set sequences can be consulted inSupplementary Table S4). From all the available complete HIV-1 genomes (>9,100) in the HIV-1 sequence database (Theoretical Biology and Biophysics Group 2019) 10% of sequences for each of the subtypes present (according to the database) were selected at random, comprising a total of 1,057 genomes. Those genomes were then trimmed for four genomic regions: ENV, GAG, POL, and PR-RT. Together, the full genome sequences and the trimmed replicates compose the full test set (5,285 sequences). This test set was designed to explore the capabilities of each subtyping method in an extensive variety of subtypes while using different HIV-1 genomic regions as input. However, there are differences in the methods implementations; SCUEAL only subtypes sequences of the POL region, therefore the comparison dataset for this tool is smaller (1,057 sequences × 3 regions = 3,171 sequences), and is capable of recognizing CRFs until the number 43; REGA is able to recognize CRFs until the number 47; COMET is cable of recognizing CRFs until the number 96. The outputs for each subtyping tool required some manipulation in order to achieve a ‘common language’, making it possible to compare all the tools outputs. Regarding REGA outputs, the terms ‘like’ and ‘potential recombinant’ were excluded, maintaining the assigned subtype; the outputs with ‘recombination of’ were named URFs of the described subtypes or URF_CPX if there was an indication of more than two recombining subtypes. REGA results marked only with the information of ‘Recombinant’ were transformed into URF_CPX; and when the outputs were ‘Check the report’ no subtyping result was assigned. For the COMET outputs the only transformation was to change ‘unassigned’ to URFs of the subtypes present or URF_CPX if more than two subtypes were reported. Concerning the SCUEAL outputs, the words ‘ancestral’ and ‘like’ were excluded; ‘Complex’ was converted to URF_CPX and ‘AE’ to CRF_01. The SCUEAL results with ‘recombinant’ and more than two subtypes were converted to URF_CPX and those with less than two subtypes converted to URFs; for outputs with ‘U’ and ‘FAILED’ no subtyping result was assigned. The comparison between the four subtyping methods results can be seen inFig. 4. The highest level of agreement was observed between SNAPPy and COMET (83%), whereas SCUEAL and REGA had the lowest concordance (61%). The remaining pairs showed results in the range between 72 and 78%.

We also calculated the precision, recall, and F1 scores (balance of precision and recall) for the three subtyping methods tested (REGA, SCUEAL, and COMET) versus SNAPPy (Supplementary Tables S5–S7). This analysis was performed to give an indication for users comparing results obtained with other tools and SNAPPy, for each HIV-1 group, subtype, sub-subtype, CRF, and URF. The precision and recall metrics can be seen as indicators if SNAPPy is classifying a given subtype in the test set more or less often, respectively, than the subtyping method it is being compared with. Without surprise, in the test set the results for Subtypes B and C (the most abundant) and non-M HIV-1 groups (N, O, and P) showed the highest F1 scores. The results for URFs and CRFs showed great variability.


	Supplementary Material

