Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing


	Methods



	DNA methylation of controlled mixtures

We used E. coli (K12 ER2925) control reads[8], for which DNA was amplified by PCR (unmethylated, negative control) and half of the PCR-amplified DNA was subjected to CpG methyltransferase (M.SssI) treatment to methylate cytosines at CpG sites (methylated, positive control). It was reported that M.SssI has an efficiency of 95%[8]. This would not affect the benchmarking with the fully unmethylated set (0% methylated) but it might affect the other mixture datasets. Accordingly, to avoid potential biases in our benchmarking analysis, we considered bins of percentage methylation ranges, in steps of 10%. For instance, the bin of the fully methylated sites included sites of 90–100% methylation. After alignment of all PASS reads for both controls using Minimap2[21] and removing secondary and supplementary reads, 110,795 reads of unmethylated control and 69,453 reads of methylated control remained. From the 346,793 CG sites in the E. coli reference genome (NC_000913.3), we selected 100 random sites with a single CpG in a 20 nt window (NNNNNNNNNCGNNNNNNNNN, with no CG in the region with Ns) that had aligned sequencing reads from both positive and negative control datasets. Using the filtered PASS reads covering these 100 selected CpG sites from the control datasets (Supplementary Data1), we created 11 benchmarking datasets with specific mixtures of methylated and unmethylated reads, namely, containing 0%, 10%,…, 90%, 100% of methylated reads, which we used for the initial benchmarking of the tools to predict different methylation mixtures per genomic site (mixture dataset 1) (Supplementary Data3). For independent validation, we built an independent mixture dataset 2. We selected a different set of 50 CG sites (single CG in a 20 nt window) (Supplementary Data2), and further selected 6803 different reads from the remaining fully methylated or fully methylated reads not used in the dataset above. We built again 11 benchmarking datasets with specific mixtures of methylated and unmethylated reads, namely, containing 0%, 10%, 20%, …, 90%, 100% of methylated reads (mixture dataset 2) (Supplementary Data4). We used the mixture dataset 2 to obtain the correlation of the different methods with the expected percentage methylations using the default thresholds suggested by each tool and alternative thresholds obtained from dataset 1. We also used this dataset to test our combined model (see below).


	CpG methylation detection with six different tools

We developed a standardized workflow for six tools: Nanopolish[8], Megalodon[12], DeepSignal[13], Guppy[14], Tombo[15], and DeepMod[16]. Snakemake pipelines to run these tools are available at GitHub[11]. When we selected the tools to be included in this study, we excluded NanoMod[22], as it only allowed detection of methylation differences using a control and a test sample, and SignalAlign[10], as its repository had not been updated for over four years; and mCaller[23], because it only had been trained for 6mA, but not for 5mC.
Methylation level was collected for individual CpG sites in both strands and considered per site and per read or summarized (methylation frequency) per site. For the comparison with bisulfite sequencing, we also considered the approach of merging the CpG methylation calls from both strands into a single strand. This was done in the following way: For the sites that had methylation frequencies from both strands, we combined the sites by averaging the methylation frequencies and adding up the coverage. Otherwise, we kept the methylation frequency the same if the site only had methylation prediction from one strand.
Nanopolish (v0.13.2) assigns a log-likelihood ratio to each individual CpG site or to a group of nearby CpG sites that share the same methylation level in each site within the group. Nanopolish uses reads with a minimum mapping quality score of 20. A positive log-likelihood ratio value indicates evidence of methylation. To include all the predictions per read and per site, such CpG groups were split up into the constituent sites with the same log-likelihood ratio using a Python script incorporated in our Snakemake pipeline ( https://github.com/comprna/METEORE)[11]. The output file was further processed in R (v3.6.3). As for default cutoffs, we considered those originally suggest by Simpson et al.[8], i.e., log-likelihood > 2.5 for methylated sites and <−2.5 for unmethylated sites. Although the default log-likelihood ratio threshold has changed to 2.0 in v0.12, we did not see major differences between both thresholds (Supplementary Fig.19). For our benchmarking analysis we used the log-likelihood ratio of 2.5 for sites in individual reads. The methylation frequency was then calculated for each site as the number of mapped reads predicted as methylated divided by the number of total mapped reads.
Tombo (v1.5.1) resquiggles all raw nanopore reads before modified base detection. It then implements three approaches to detect nucleotide modifications: (1) the alternative base detection approach, which computes a statistic by scaling log-likelihood ratios to identify targeted bases where the signals match the expected level for a non-canonical base; (2) the de novo approach, which performs a hypothesis test by statistically comparing signals to an in-silico reference; and (3) the sample comparison approach. This latter approach provides two different ways for modified base detection, one uses a canonical model adjusted by a control set of reads to identify deviations between expected and observed levels, while the other one compares signal levels from two sets of reads at each reference position. Of the three approaches, we used the alternative model specific for CpG to be able to predict CpG methylation in individual samples and reads. This model tests the signal levels against expected canonical and alternate 5mC in at CG motifs, producing the per-read binary statistics in HDF5 format, where positive values indicate canonical bases and negative values modified bases. For the initial analysis, we used the default cutoffs of −1.5 and 2.5 where scores below −1.5 were considered as methylated and above 2.5 unmethylated, and scores between these thresholds did not contribute to the per-site methylation. Tombo outputs four individual wiggle files (WIG format), one per strand, reporting the read coverage level and the methylation scores. These files were converted to the same TSV format used by other tools with a Python script included in our Snakemake pipeline, with the score and coverage for each mapped CpG site for downstream analyses. For the benchmarking analysis per site and per read, we used the per-read binary statistics given by Tombo.
DeepSignal (v0.1.7) uses the resquiggling algorithm from Tombo in the first step. It then predicts the methylation state of the targeted cytosine at CpG motifs and outputs the probabilities of being methylated, P(m), and unmethylated, P(u), for each cytosine in each read. For the initial analysis, a base was considered as methylated if the methylated probability was greater than the unmethylated probability, P(m) > P(u), and unmethylated otherwise. In addition, a Python script was added to our Snakemake pipeline to calculate the methylation frequency at each CpG site. For the benchmarking analyses per site and per read, we used a score calculated as the log2 ratio of the probabilities, i.e., log 2(P(m)/P(u)).
Guppy (v3.6.0) filters the nanopore reads based on the read alignment and implements a neural network model that basecalls 5mC at CG sites as a fifth base along with the four canonical DNA bases. It is only available to members of the nanopore community ( https://community.nanoporetech.com). We used the configuration file named dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg to run running Guppy’s modified basecalling model. This produced the reads supporting the modifications in fast5 format. We then used the scripts provided at https://github.com/kpalin/gcf52ref to convert the guppy methylation calls from fast5 files to reference anchored files similar to Nanopolish. These scripts were incorporated into our Snakemake pipeline ( https://github.com/comprna/METEORE)[11].
Megalodon (v2.2.4) implements a neural network model that uses Guppy (v4.0.11) to rebasecall all reads and then identifies 5mC by anchoring the basecalling output to the reference, assigning a score for the candidate modified base and performing a calibration for the conversion of the raw scores to estimated empirical probabilities. Megalodon requires Guppy to be installed and the path to Guppy basecalling executable server to be set. We used the most recent basecalling model in Rerio for Megalodon[24] (res_dna_r941_min_modbases_5mC_CpG_v001.cfg). Megalodon produces per-read modified base log probability and canonical base log probability at each mapped CpG site. A default threshold of 0.75 was used as a minimum score for both modified and canonical base probabilities to include a modified basecall in the final aggregated output in the bedMethyl format file containing per-site coverage and methylation percentage. For the benchmarking analysis per site and per read, we used a log-likelihood score calculated by subtracting the natural log probability of the modified base, log(M), and the natural log probability of the canonical base, log(C), resulting in log(M/C).
DeepMod takes single-read fast5 files as input and uses reads with a mapping quality score greater than 10. It then outputs a methylation prediction summary per site at genome level for each strand in a BED format containing the coverage, the number of methylated reads, and methylation percentage. As DeepMod did not provide any information for individual reads, we could not use it for the per-read benchmarking analyses.


	METEORE consensus models

METEORE was created to provide a consensus prediction using the scores from two or more methods. As the two main approaches for supervised learning are classification and regression, we implemented both types of methods in METEORE to test the combination of tools. For the classification model, we used an RF classifier[25] from the Python sklearn library[26]. We scaled the prediction scores from each individual method to the range of [0, 1] using min–max scaling[25]. The ROC and PR curves were built from a tenfold cross-validation on the prediction scores from the input methods to produce ROC and PR curves. The reads were randomly selected during cross-fold validation. For the METEORE implementation, we used the parameters max_dep = 3 and n_estimator = 10. We also tested the default parameters of RF from sklearn (n_estimator = 100 and max_dep = None). The METEORE RF model tested in our analyses was trained on the entire mixture dataset 1 with the scores from Megalodon and DeepSignal and using parameters max_dep = 3 and n_estimator = 10. The multiple linear regression-based approach used sklearn’s RidgeCV linear regression (REG) model, and min–max scaling as in the RF model. The initial ROC and PR curves were produced with the built-in fivefold cross-validation in RidgeCV. The METEORE REG model tested in our analyses was trained on the entire mixture dataset 1 with the scores from Megalodon and DeepSignal. After prediction, resulting scores were classified as unmethylated if they were less than 0.5 and methylated if greater. METEORE REG required reads with prediction scores from all provided tool inputs for using multiple linear regression to model expected methylation, leading to a slight (~10%) decrease in observed reads in each test. The scripts to train and run METEORE are available at https://github.com/comprna/METEORE[11].


	Processing of per-site methylation calls

The raw output of each of the tools contained methylation information for each aggregated CpG site on both strands. That is, each mapped CpG site on the positive strand of the human reference genome had a counterpart CpG site mapped on the negative strand. To perform the per-site benchmarking, we used the positive strand coordinate system, so the mapped sites that were on the negative strand were lifted to positive coordinates by subtracting 1 from the coordinate position. If there were predictions made on both strands for the same site, we obtained the mean methylation frequency for that site. If there was no per-site information only for one of the strands, we kept the same prediction from that strand and lifted it to the positive strand if necessary.


	Sequence motif analysis and visualization

We collected 7-mers of targeted CpG sites from E. coli mixture datasets and all sites detected in the CGIs from the nCATS datasets. Sequence logos were generated from 7-mer data in FASTA format using WebLogo[27]. We explored the k-mer contexts of the CpG sites using the high-coverage nCATS data from Gilpatrick et al.[18]. The sequence motifs were grouped by the absolute difference between the methylation frequencies produced by each tool and the WGBS values for each CpG site in an 8-mer context (NNNCGNNN), where a site with a score of 0 was labeled as “low discrepancy,” and with a score greater than 0.5 was labeled as “high discrepancy.” The pLogo web tool[28] was used to visualize the sequence motifs and assess the significance of the differences in the frequency of residues between the “low discrepancy” k-mers (foreground dataset) and the “high discrepancy” k-mers (background dataset).


	Guide RNA design and RNP complex assembly

We used the nCATS protocol[18] to target ten regions of the human genome. This PCR-free protocol uses Cas9 to cut double-stranded DNAs (dsDNAs) at specific sites and then preferentially ligates sequencing adapters to the cleaved ends for enrichment[18]. The cleaved target DNA strands with adapters attached are then sequenced. We designed ten pairs of RNA CRISPR guides (crRNAs) for ten forensically relevant regions (Supplementary Table5). An initial panel of candidate crRNAs was designed using the freely available tool CHOPCHOP[29] as recommended in the ONT protocol. For each target region, about three to five best crRNAs were selected based on the cleavage location, crRNA efficiency, and the number of predicted mismatches using CHOPCHOP. The crRNAs were then evaluated by the IDT’s design checker[30] to select for high on-target performance and low off-target activity. Ten pairs of guide RNA were used to enrich ten human regions ranging from 8 to 36 kb. Here, one gRNA was used on either side of each target region (Supplementary Table6). Each RNA oligo including crRNAs (IDT, custom-designed) and tracrRNA (IDT, 11-05-01-12) was resuspended in IDTE buffer pH 7.5 (IDT, 11-01-02-02) to a final concentration of 100 μM. crRNAs were then pooled to make an equimolar crRNA mix by combining equal volumes of each crRNA. In order to form gRNA duplexes, the crRNA pool and tracrRNA were then combined in equimolar concentrations with duplex buffer (IDT, 11-05-01-12), followed by denaturation for 5 min at 95 °C, then allowing to cool to room temperature. RNP complexes were created by assembling the following components: gRNA duplexes, CutSmart buffer (NEB, B7204), nuclease-free water, and Cas9 endonuclease (IDT, 1081058).


	Library preparation

Genomic DNA (gDNA) from the GM12878 human cell line was obtained from the Coriell Institute (coriell.org) (cat. no. NA12878). The purity of the purchased gDNA was measured with the Nanodrop spectrophotometer (Thermo Fisher) at the 260/280 and 260/230 nm values. Dephosphorylation of 5′ ends of DNA was performed as follows. A 5 μg amount of input DNA was dephosphorylated to prevent downstream adapter ligation using Quick Dephosphorylation Kit (NEB, M0508). DNAs were resuspended in the CutSmart buffer and dephosphorylated with Quick CIP enzyme in a PCR tube for 10 min at 37 °C, followed by heating for 2 min at 80 °C for CIP enzyme inactivation. Cas9 Cleavage and dA-tailing was performed as follows. a 100 mM aliquot of dATP (NEB, N0440S) was first diluted to 10 mM. After allowing the dephosphorylated DNA sample to return to room temperature, the preassembled RNP complexes, 10 mM dATP, and Taq DNA Polymerase with Standard Taq Buffer (NEB, M0273) were added to the PCR tube containing the sample for the in vitro digestion reaction and subsequent dA-tailing. The sample was then incubated at 37 °C for 15 min, and then at 72 °C for 5 min. By dephosphorylating preexisting DNA ends prior to Cas9 cleavage, sequencing adapters and ligation buffer from the Oxford Nanopore Ligation Sequencing Kit (ONT, LSK109) were preferentially ligated to the cleaved DNA ends at Cas9 cleavage sites using T4 Ligase from the NEBNext Quick Ligation Module (NEB, E6056) for 10 min at room temperature. The sample was cleaned up to remove excess adapters using the Agencourt AMPure XP beads (Beckman Coulter, A63881), washing twice on a magnetic rack with the long-fragment buffer (ONT, LSK109) before eluting in 14 µl of elution buffer (ONT, LSK109). A 1 µl aliquot of the final library was quantified using the Qubit dsDNA Broad Range Assay Kit (Thermo Fisher). Starting with 5 μg of input DNA, 1 μg of DNA was recovered after library preparation. The library was stored on ice until ready to load.


	Sequencing and data processing

Before loading, the flowcell was primed with a solution consisting of flush buffer (ONT, LSK109) and flush tether (ONT, LSK109). The sequencing library was prepared by adding sequencing buffer (ONT, LSK109) and loading beads (ONT, LSK109) into the DNA library, and then loaded into the flowcell. The sample was run on a MinION flowcell (FLO-MIN106, R9.4.1 pore) using the MinION sequencer for 19 h, operated using the MinKNOW software. Live basecalling was carried out during the experiment using Guppy’s fast basecalling model in MinKNOW. The resulted FASTQ files were immediately aligned to the human reference genome (GRCh38/hg38) using minimap2[21], followed by visualization with Integrative Genomics Viewer[31] to confirm the generation of on-target sequencing reads. Post-run basecalling was performed using Guppy (v.3.2.4) high-accuracy model to generate the final set of sequencing reads with higher read accuracy than the fast model and recognition of modified bases from the electrical signal data. Reads were aligned to the human reference genome (GRCh38/hg38) using minimap2. Using Samtools[32] (v1.9), we collected aligned reads within the enriched regions as on-target reads. Those outside the targeted regions were considered off-target reads and subsequently discarded. Coverage plots for target regions were generated using the Snakemake[17] workflow developed by Oxford Nanopore Technologies[33].


	Comparison with bisulfite sequencing data

We compared with the published WGBS data[19] for NA12878 (ENCODE accessions: ENCFF279HCL, ENCFF835NTC) using two different approaches. In the first approach, we compared every CpG site on both strands for nanopore and WGBS data, preserving the strand information. In the second approach, we used the positive strand coordinate system by lifting all sites from the negative strand to be on the positive strand, i.e., site position on the negative strand minus one. For the sites that had methylation evidence from both strands, we combined the sites by averaging the methylation frequencies and adding up the coverage. We preserved the information if the site only had methylation prediction from the positive strand or the negative strand.
To obtain high confidence methylation calls from WGBS data for validation, the resulting individual or combined CpG sites were processed in the following way. CpG sites with zero coverage from both WGBS replicates were discarded. Furthermore, we calculated the difference in methylation frequency between both WGBS replicates and considered the 0.1 and 0.9 quantiles of the distribution of differences. A CpG site was kept if the difference for that site was between those 0.1 and 0.9 quantiles, otherwise it was removed. We finally calculated the number of sites, Pearson and Spearman correlations, and coefficient of determination between methylation frequencies calculated from WGBS and those calculated from nanopore reads by each of the tested methods using cor.test() and lm() functions from R[34]. We also subsampled the reads for the CpG sites that were covered with at least 1, 5, 10, 15, 20, etc. reads and calculated Pearson correlation at different levels of read coverage for further evaluation.


	Reporting summary

Further information on research design is available in theNature Research Reporting Summary linked to this article.
