pgHMA: Application of the heteroduplex mobility assay analysis in phylogenetics and population genetics

The heteroduplex mobility assay (HMA) has proven to be a robust tool for the detection of genetic variation. Here, we describe a simple and rapid application of the HMA by microfluidic capillary electrophoresis, for phylogenetics and population genetic analyses (pgHMA). We show how commonly applied techniques in phylogenetics and population genetics have equivalents with pgHMA: phylogenetic reconstruction with bootstrapping, skyline plots, and mismatch distribution analysis. We assess the performance and accuracy of pgHMA by comparing the results obtained against those obtained using standard methods of analyses applied to sequencing data. The resulting comparisons demonstrate that: (a) there is a significant linear relationship (R2 = .992) between heteroduplex mobility and genetic distance, (b) phylogenetic trees obtained by HMA and nucleotide sequences present nearly identical topologies, (c) clades with high pgHMA parametric bootstrap support also have high bootstrap support on nucleotide phylogenies, (d) skyline plots estimated from the UPGMA trees of HMA and Bayesian trees of nucleotide data reveal similar trends, especially for the median trend estimate of effective population size, and (e) optimized mismatch distributions of HMA are closely fitted to the mismatch distributions of nucleotide sequences. In summary, pgHMA is an easily‐applied method for approximating phylogenetic diversity and population trends.


| INTRODUC TI ON
Molecular phylogenetics and population genetics within and among closely related species aim to elucidate the relationships of genetic lineages and the historical evolutionary and ecological processes that contribute to biodiversity (Yang & Rannala, 2012). Polymorphism within species and divergence between species offer complementary views on micro-and macroevolutionary processes (Wilson et al., 2011); and patterns of sequence variation allow us to estimate the genealogy of sampled individuals, reconstruct the demographic history of populations, and infer the ecological and biogeographic drivers of genetic diversification (Ho & Shapiro, 2011).
Typically, in molecular phylogenetics and population genetics, one or more fragments of DNA from loci of interest are amplified from a sample of individuals, or other evolutionary units. These amplified fragments, or amplicons, are sequenced, and used in downstream phylogenetic and population genetic analyses. In animals, for instance, mitochondrial (mt) DNA is one of the most extensively used genetic markers in phylogenetics and population genetics studies (Cameron, 2014;da Fonseca et al., 2008;Ma et al., 2012;Nabholz et al., 2008). In particular, the control region (also known as the D-loop region) is generally considered as the longest non-coding region and the most variable region of the mt genome (Bronstein et al., 2018;Stoneking et al., 1991). Consequently, this region has been widely exploited to assess the genetic diversity and population structure for species in both vertebrate and invertebrate studies (Delport et al., 2002;Firestone, 2000;Vila & Bjorklund, 2004;Yu et al., 2016). Although researchers are now confident with using amplified DNA sequences in phylogenetic and population genetics, sequencing a large number of samples for population studies can be labour intensive, costly and time consuming (Holt & Bureš, 2007).
Methods that can decrease the costs, effort, and time of these studies will be useful.
In this study, we detail the use of a molecular method for quantifying (and applying) estimates of genetic diversity to phylogenetic and population genetic studies, that does not require sequencing.
Whereas researchers may still want to use nucleotide sequences as a gold standard to verify their results, the method we propose to use -the heteroduplex mobility assay (HMA) (Delwart et al., 1993) -can be applied either (a) in a pilot study upon which subsequent sequencing sampling designs can be developed, or (b) when all that is required is a robust and inexpensive estimate of genetic diversity.
The HMA is a simple and rapid method for the detection of DNA variants, and was originally developed as a screening tool to determine genetic relationships of human immunodeficiency virus (Delwart et al., 1993). Heteroduplexes form when amplicons from two genetically dissimilar individuals are mixed, denatured and allowed to reanneal. Each DNA strand can reanneal to another strand from the same individual, and form a homoduplex, or DNA strands from two different individuals can reanneal to form a heteroduplex.
Heteroduplexes show a reduced mobility under electrophoresis in an appropriate medium (e.g., a polyacrylamide gel (Delwart et al., 1995)) that can discriminate between molecules with small differences in weight or conformation. The degree of heteroduplex retardation relative to the mobility of the homoduplexes is proportional to the sequence dissimilarity between the DNA strands constituting the heteroduplex (Delwart et al., 1993(Delwart et al., , 1994(Delwart et al., , 1995. Since its first application, the HMA has been used successfully in studies of DNA sequence variation and genetic polymorphisms in viruses, bacteria, and humans (Cai et al., 1991;Ellis & Zambon, 2001;Espejo et al., 1998;Turner et al., 2002). It is fair to say, however, that nucleotide sequences remain the preferred data for molecular phylogeneticists and population geneticists. Nonetheless, the rapidity and ease of a heteroduplex mobility analysis, and low costs of the assay both in terms of money and time, makes it eminently suitable for pilot projects or studies that need only a good approximation of genetic diversity.
In the following sections, we describe a simple, reliable, less time-consuming, and easily applied method using heteroduplex analysis by microfluidic capillary electrophoresis to simplify the detection of DNA variants. We also demonstrate how one would perform a phylogenetic analysis with bootstrapping and skyline plots to estimate the demographic history of a population using heteroduplex mobility distances. Finally, we show how to construct mismatch distributions (Rogers & Harpending, 1992) using the relative concentrations of heteroduplexes obtained experimentally. Collectively we refer to these analyses as pgHMA (phylogenetic and population genetic heteroduplex mobility analyses). To validate the performance of pgHMA, we also obtained nucleotide sequences of the amplicons used in the heteroduplex analyses to conduct the same analyses as pgHMA. Our results indicate that pgHMA provides good estimates of genetic diversity, and robust approximations of results obtained using other methods of analyses that rely on sequence data. We suggest, therefore, that pgHMA is appropriate when what is required are quick and reliable estimates of diversity.

| PCR amplification and sequencing
We extracted genomic DNA from approximately 30 mg of meat using the DNeasy Blood & Tissue Kits (Qiagen). The mt genome control region (~1430 bp in length) of the 16 kangaroo samples was amplified by the pair of primers (LtPro: 5′-CCCAAAGCTGANATTCTA -3′, HtPhe: 5′-CCATCTAAGCATTTTCAGT -3′), which was modified and selected from a previous study (Nilsson et al., 2003).
We performed PCR reactions using Takara PrimeSTAR GXL DNA polymerase (Kusatsu) under the following conditions: 1 min initial denaturation at 95°C, followed by 30 cycles of 10 s at 98°C, 15 s at 55°C, and 2 min at 68°C. The PCR products were electrophoresed in 1% agarose gel, purified, and then sequenced by ABI 3730XL capillary sequencer at the Australian National University.

| Heteroduplex formation and analysis
Heteroduplexes were prepared in a 384-well skirted PCR plate by mixing 5 µl of each of two samples PCR products (~150 ng of DNA) and 1.1 µl heteroduplex annealing buffer (including 1 M NaCl, 100 mM Tris at pH 7.8 and 20 mM EDTA), using a previously described protocol for the application of HMA in studies of human immunodeficiency virus (Delwart & Gordon, 1997;Delwart et al., 1995). For these 16 samples, multichannel pipettes were used to prepare all 120 possible pairwise heteroduplexes. With the mixture of more than two samples, 100 ng of each sample PCR product was mixed together with the same proportion of heteroduplex annealing buffer. To form the heteroduplexes, the mix was denatured at 95°C for 2 min, rapidly cooled to 4°C, and then placed on ice for 10 min.
For the LabChip GXII system (PerkinElmer, Waltham), the heteroduplex mixture in a 384-well skirted PCR plate was performed with HT DNA 5K LabChip and DNA 5K reagent kit (PerkinElmer).
Homoduplexes and heteroduplexes were visualized and evaluated by LabChip GX software.

| Heteroduplex mobility distance calculation
The number of possible pairwise heteroduplexes increases quadratically with the number of samples (N) according to the formula [N(N−1)]/2 (Delwart et al., 1995). Hence, 120 pairwise mixtures of mtDNA amplicons obtained from the 16 kangaroo samples were assayed using the HMA (Table S1). The heteroduplex mobility distance (d H ) of each pairwise combination was quantified from the LabChip results ( Figure S1) using the following formula: where d He is the average migration time of heteroduplex fragments peaks, and d Ho is the average migration time of homoduplex fragments peaks. d UM is the migration time of the upper marker peak and d LM is the migration time of the lower marker peak. We express heteroduplex migration in units of proportion or percentage. We developed a new program named GetPeaks to provide automated analysis on the raw data from LabChip GX software. For researchers without access to a LabChip GXII system, we also illustrate a simple example of calculating the heteroduplex mobility distance using polyacrylamide gel electrophoresis in Figure S2.

| Phylogenetic analysis
Phylogenetic trees and bootstrap supports were obtained using the HMA results and DNA sequences of 16 mt genome control regions of kangaroos in this study, as follows.
Based on the distance matrix of HMA mobility distances, we use r-package ape 5.0 (Paradis & Schliep, 2019) to reconstruct the NJ phylogenetic trees (Saitou & Nei, 1987). The nonparametric bootstrap method of generating pseudoreplicates by resampling with replacement cannot be used in heteroduplex analysis, as only a single distance matrix was obtained from HMA. An alternative approach is the parametric bootstrap (Efron, 1985), whereby independent pseudoreplicates are generated using simulated data from a fitted model (Felsenstein, 1988).
A parametric bootstrap analysis of HMA was performed in the following way: using the linear regression between genetic distances and heteroduplex distances, we obtained an expected (i.e., fitted) value for each genetic distance corresponding to a heteroduplex distance, as well as a 95% prediction interval symmetric around that expected value. To generate pseudoreplicate genetic distances for each observed heteroduplex distance, we sampled 10,000 values from a normal distribution with the mean equal to the expected genetic distance, and the approximate standard deviation equal to 0.5*(95% prediction interval)/1.96. A bootstrap pseudoreplicate distance matrix was generated with the same size as the original distance matrix. Each pseudoreplicate was analysed using the NJ method. The bootstrap support values were summarized from these pseudoreplicate trees, and subsequently placed on the phylogenetic tree constructed from heteroduplex distances. We also developed a new r package named pgHMAtools to perform the parametric bootstrap analysis for HMA data.

| Skyline plot analyses
Estimation of demographic history from HMA results was based on the original skyline plot algorithm developed by Pybus et al. (2000). A total of 10,000 ultrametric trees were inferred from HMA parametric bootstrap pseudoreplicate distance matrices using the UPGMA method (Sneath & Sokal, 1973) phangorn (Schliep, 2011). The classic skyline plot method was used to assess these ultrametric trees using r-package ape 5.0 (Paradis & Schliep, 2019), then pgHMAtools was developed and employed to extract the 95% highest density intervals and depict the HMA skyline plot. To reduce variation in skylines, we used a sliding window with window size of 100 and step size of 20 in r-package zoo (Zeileis & Grothendieck, 2005) to smooth the skyline.
Coalescent Bayesian skyline methods (Drummond et al., 2005) were applied to DNA sequence data using beast 2.5 (Bouckaert et (Table S3). Mismatch distributions for these 20 mixtures of nucleotide sequences were constructed using distances between all pairs of amplicon sequences in each random sample. The same 20 mixtures were used to obtain heteroduplex profiles that were visualized and analysed on the LabChip GXII system using the method mentioned above. A least-squares method was applied to minimize the differences between mismatch distributions obtained using nucleotide sequences and those obtained with HMA data, by using the "L-BFGS-B" numerical optimisation method (Byrd et al., 1995) in r (R Core Team, 2019) with the following formula: where n represents the total number of bins in the mismatch distributions (n = 300 in this study), f i represents the frequency of nucleotide distances in the ith bin, and h i represents the modified heteroduplex "frequency" (a function of the observed amplitude, f H , of a given heteroduplex distance measured on the LabChip). The goal is to find a function that modifies f H , of the original heteroduplex distance d H such that Q is minimised. To do this, we note that the "frequencies" (or amplitudes) of small values of d H tended to be higher relative to the frequencies of true pairwise distances, whereas "frequencies" of larger values of d H were amplified. For this reason, we tested a variety of modifying functions, g(d H ), and applied these to obtain h i as follows: The functions of g(d H ) tested were: (1) linear model (2) exponential model (3) log-normal model (4) gamma model The exponential model provided the best fit between the sequence and heteroduplex mismatch distributions:

| Heteroduplex mobility distances and nucleotide sequence analysis
The relative heteroduplex mobility distance measured using the LabChip GXII between mtDNA control region amplicons obtained from 16 individual kangaroo tissue samples varied from 0.9% to 49.3%, with the exception of four combinations for which only a single homoduplex product was observed (relative mobility = 0). These latter results indicate that these pairs of samples are very closely related (Table S1).
To verify the calculation and findings of pgHMA, the entire control region of the 16 kangaroo samples was sequenced and analysed.
These 16 control regions range in length from 1430 to 1436 bp with A + T content of 66.2% ± 0.26. The pairwise distances of nucleotide sequencing data were estimated by p-distance, the proportion (p) of nucleotide sites in an alignment with differing characters, a measure of genetic distance that is mechanistically related to the heteroduplex formation and mobility (Delwart et al., 1993;Tang & Unnasch, 1997). The sequence divergence between 16 kangaroo control regions varied from 0.28% to 19.5% (Table S2), which is much higher than the range observed from other kangaroo species including the eastern grey kangaroo Macropus giganteus (0.16%−6.88%) (Zenger et al., 2003) and the red kangaroo Macropus rufus (0.1%−6.2%) (Clegg et al., 1998). This result implies that there may be more than one species in these 16 kangaroo samples, which were purchased from local supermarkets. Based on the sequences of the mitochondrial control region, phylogenetic analysis was carried out with these 16 kangaroos and other five species commonly referred to as kan-

| Correlation analysis between pairwise heteroduplex distances and pairwise nucleotide sequence distances
To investigate the relationship between heteroduplex mobility distance and nucleotide distance, the relative heteroduplex mobilities where y is the pairwise distances of nucleotide sequences. The square of Pearson's correlation coefficient between d H and y was 0.992.

| Comparison of phylogenetic analyses using pgHMA and nucleotide sequences
The heteroduplex mobility distance matrix can be used to reconstruct a phylogenetic tree using any distance-based phylogenetic method, such as minimum evolution (Rzhetsky & Nei, 1992), UPGMA (unweighted pair group method with arithmetic mean) (Sneath & Sokal, 1973) and neighbour-joining (Saitou & Nei, 1987). In our study, phylogenetic analysis of the heteroduplexes mobility distances (Table S1) was performed using neighbour-joining. To make the results more directly comparable to the analysis of pgHMA, we also built a NJ tree of the nucleotide sequence data using pairwise p-distances (Table S2).
The results of our phylogenetic study based on these two datasets produced nearly identical topologies (Figure 2). Both phylogenies y = 0.372d H + 0.011  (Robinson & Foulds, 1981) between these heteroduplex and nucleotide phylogenies is 2 with the difference due to a shuffle of three closely related taxa (KS4, KS6 and KS8). We note some differences in branch-lengths between these two trees, with the tendency for the pgHMA tree to have longer branches than the nucleotide tree, although these differences do not appear to be substantial.

F I G U R E 1 Linear relationship between
Bootstraps generated using nucleotide alignments and pgHMA distances are obtained in different ways. Parametric bootstraps of pgHMA are obtained by sampling from the prediction interval along the linear regression, while the nonparametric bootstraps of nucleotide sequences are generated by resampling the sites from the sequence alignment (Yang & Rannala, 2012). Unsurprisingly, there are differences between the phylogenetic bootstrap values obtained with pgHMA and nucleotide data (Figure 2), although there is a pattern to these differences. In particular, if a clade is strongly supported with the pgHMA bootstrap (~70% or higher), it is also strongly supported with the nucleotide bootstrap. The converse is not true: for instance, the clades (KS5, KD4) and (KS3, KD6) are weakly supported by the pgHMA bootstrap but strongly supported by bootstraps using nucleotide data. We surmise that this pattern is a function of the length of the branch immediately internal to the clade in question: when this branch is short, the pgHMA bootstrap will be weak, for example, branches leading and internal to the clades (KD2, (KS2, (KS5, KD4))).

| Skyline plot analysis with HMA data
A number of skyline plot methods are available for estimating demographic history from an estimated genealogy (also known as single-tree skyline-plot methods, e.g. classic and generalized skyline plots (Pybus et al., 2000;Strimmer & Pybus, 2001)) or directly from nucleotide sequence data (e.g., Bayesian skyline (Drummond et al., 2005) and Bayesian skyride plots (Minin et al., 2008)). Among these methods, Bayesian skyline plots have been commonly used in population and phylogeographic studies (Heller et al., 2013), allowing the phylogeny and population history to be coestimated within a single analysis (Drummond et al., 2005). However, Bayesian skyline methods that utilise nucleotide data cannot presently be extended to HMA distances.
Here, we apply the classic skyline-plot method to HMA distances, using the bootstrap method described in the previous sec-  (Drummond et al., 2005), sampled trees from the latter were analysed and displayed in the same way as pgHMA bootstrap trees.
Skyline plots estimated from the UPGMA trees of HMA data and Bayesian trees of nucleotide data are shown in Figure 3a,b, respectively. These two plots show similar underlying trends with respect to median historical values of effective population size, particularly with the smooth skyline plots constructed using sliding windows ( Figure S4); however, the 95% credibility intervals of the Bayesian results are wider than the 95% density intervals of the HMA skyline plot except at the beginning of the plots.

| Mismatch distribution analysis of HMA
Mismatch distributions are constructed by binning pairwise nucleotide site differences and plotting a frequency histogram of these distances (Grant, 2015). A "ragged" multimodal distribution in which there are multiple peaks in the distribution characterizes the population in equilibrium or in decline, whereas a smooth unimodal distribution is considered as evidence of recent population expansion (Harpending, 1994).
Here, we use pgHMA distances to estimate the mismatch distributions, with the comparisons of related analyses from DNA sequences. To better assess the relationship of mismatch distribution between nucleotide sequences and HMA data based on the mitochondrial control regions, four more taxa from Ranjard et al. (2018) were added to the data set to give a total of 20 kangaroo mtDNA samples. From these 20 samples, 20 groups were randomly assembled, each group comprising the mtDNA amplicons of 5, 10, 15 or 20 individuals (Table S3). The heteroduplex analysis of the 20 mixtures were characterised using the LabChip GXII. The HMA results were transformed to the deduced distances by using the linear regression of p-distances against pgHMA distances, and then mapped to the corresponding mismatch distributions of mtDNA sequences ( Figure 4 and Figure S5).
Although a few mismatch distributions of heteroduplex distances closely fitted the related mismatch distributions based on nucleotide sequences (Figure 4c,d), most distributions resembled Samples A2 and B2 (Figure 4a,b). In particular, heteroduplex frequencies were generally higher than nucleotide distance frequencies when genetic distances were low (pairwise distance within 6.5%) and were lower than the corresponding nucleotide frequencies when genetic distances were high (Figure 4 and Figure S5).
A least-squares correction was subsequently employed to minimize the differences of mismatch distributions between nucleotide sequences and HMA data. The results showed that most optimized frequency distributions of HMA distances ( Figure S6) more closely fitted the mismatch distributions of nucleotide data than unoptimized HMA frequency distributions ( Figure S7). Moreover, the overall correlation between optimized heteroduplex frequency and nucleotide frequency (R 2 = .773; Figure S8B) was substantially higher than the correlation without optimization (R 2 = .521; Figure   S8A). Importantly, both pgHMA and nucleotide pairwise distances

| DISCUSS ION
The pgHMA with LabChip analysis revealed high sensitivity for detecting nucleotide mismatches as low as 0.8% (Figure 1), compared to the classic polyacrylamide gel method with at least 1%−2% (Delwart et al., 1993). We show that there is a significant correlation (linear regression, R 2 = .992) between heteroduplex mobility and genetic distance (Figure 1), and we expect that this linear relationship is likely to vary, depending on the particular amplicons used. Additionally, we hypothesise that for sequences with larger pairwise genetic distances, the relationship may be curvilinear rather than linear. This might constrain the taxonomic depth (i.e., within species, within genera, within families, etc) for which pgHMA is useful. We therefore recommend that researchers construct their own calibration functions before conducting subsequent downstream analyses. Since construction of a calibration function requires sequencing (to determine genetic distance), pgHMA would be useful for labs working with the same amplicon over the same taxonomic range and group. In this case, the calibration function would only need to be constructed once, and repeatedly applied across a wide range of projects. Conversely, pgHMA would not suit research groups where the amplicons or taxa are substantially different across projects.
Phylogenetic trees constructed from heteroduplex mobility distance and nucleotide sequences present nearly identical topologies This phenomenon may be caused by pgHMA bootstraps capturing the variation in pairwise heteroduplex distances that might be obtained for a given evolutionary distance. In contrast, nonparametric phylogenetic bootstraps of nucleotide sequences capture the variation that might be obtained with different datasets sampled from the same multinomial distribution of site patterns (Efron et al., 1996).
This suggests that we can reliably infer the existence of clades with strong pgHMA support.
Skyline plots are powerful coalescent-based visualisation methods for inferring historical patterns of population size from a genealogy (Ho & Shapiro, 2011). The skyline plots estimated from the UPGMA trees of HMA and Bayesian trees of nucleotide data present similar demographic trends of the effective population size, but the 95% credibility intervals of the Bayesian results are generally wider than those obtained using pgHMA method (Figure 3 and heteroduplex differences. Nonetheless, these results demonstrate that pgHMA median demographic patterns approximate those obtained using Bayesian skyline-plots. The analysis of mismatch distribution is often used in population studies to infer whether a population is growing, whether it stays at a constant size, or declining (the latter two are often indistinguishable with mismatch distributions) (Rogers & Harpending, 1992;Slatkin & Hudson, 1991). We demonstrate that the modifying function of an exponential model can be used on the amplitudes of heteroduplex concentrations to approximate the relative frequencies of nucleotide pairwise distances obtained from a sample of sequences (Figure 4 and Figure S5). Thus, the frequency distribution of pgHMA distances from mixtures of amplicons can be recovered directly from the distribution of heteroduplex concentrations observed on the LabChip GXII. This obviates the need to obtain separate pgHMA pairwise distances, or indeed, to sequence individual amplicons. If the use of mismatch distributions is the preferred tool for inferring population history, then pgHMA presents a rapid and robust alternative to sequencing.
In summary, we show that heteroduplex mobility distances may be used in place of nucleotide sequence data to deliver the same types of analyses: phylogenetic reconstruction with bootstrapping, skyline plots and mismatch distributions. As we noted above, the value of pgHMA is the ease of its application. We know that pgHMA will not replace the de facto gold standard of publication-quality nucleotide sequencing data for studies that require a high-level of phylogenetic or population genetic accuracy. Indeed, as sequencing becomes cheaper and expertise in, and access to, sequencing technologies become more available, sequencing will be even more widely used. We do think, however, that in studies where the objective is to recover an easily-obtained, less time-consuming, good approximation of phylogenetic relatedness and diversity, and demographic trends (e.g., pilot studies of population diversity in conservation, studies that involve frequent monitoring of microbial OTUs, initial comparisons of different geographic subpopulations within a species, the detection of mislabelled food products, identification of illegally-obtained exotic meat products, and so on), pgHMA would be an ideal method.

ACK N OWLED G EM ENTS
We thank Niccy Aitken (Australian National University) for the help with laboratory work. We also thank Carsten Külheim, Xia Hua, Bui Quang Minh, Imelda Forteza, Xianghan Li, Yiran Zou, and Tianshu Yang for participating in our group meetings where these results were discussed. This work was supported by the Australian Research Council (grant no. DP160103474).

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5061/ dryad.4tmpg 4fb6.

DNA sequences have been deposited in GenBank under Accession
Number: MT920704-MT920719. The data sets supporting the results, coupled with the program GetPeaks, R package of pgHMAtools, and scripts for all analyses used in this article are available in the GitHub repository (https://github.com/tengc hn/pgHMA).