Abstract
While genetic exchange between nonsister species was traditionally considered to be rare in mammals, analyses of molecular data in multiple systems suggest that it may be common. Interspecific gene flow, if present, is problematic for phylogenetic inference, particularly for analyses near the species level. Here, we explore how to detect and account for gene flow during phylogeny estimation using data from a clade of North American Myotis bats where previous results have led researchers to suspect that gene flow among lineages is present. Initial estimates of phylogenetic networks and species trees indicate that subspecies described within Myotis lucifugus are paraphyletic. In order to explore the extent to which gene flow is likely to interfere with phylogeny estimation, we use posterior predictive simulation and a novel Approximate Bayesian Computation approach based on gene tree distances. The former indicates that the species tree model is a poor fit to the data, and the latter provides evidence that a species tree with gene flow is a better fit. Taken together, we present evidence that the currently recognized M. lucifugus subspecies are paraphyletic, exchange alleles with other Myotis species in regions of secondary contact, and should be considered independent evolutionary lineages despite their morphological similarity.
Interspecific gene flow, hybridization, and horizontal gene transfer are processes that occur throughout the evolutionary histories of many taxa (e.g., Melo-Ferreira et al. 2014; Sullivan et al. 2014; Kumar et al. 2017; Morales et al. 2017) but are difficult to incorporate into phylogenetic and species delimitation analyses (but see Camargo et al. 2012 and Jackson et al. 2017a). Therefore, genetic exchange among species has been largely ignored in phylogenetic reconstructions, despite simulation studies that demonstrate that gene flow can interfere with phylogenetic signal (Eckert and Carstens 2008; Leache et al. 2014) and parameter estimation (Leache et al. 2014). Recent methodological advances based on phylogenetic networks and explicit incorporation of gene flow may provide an opportunity to address this shortcoming (e.g., Solís-Lemus and Ané 2016; Jackson et al. 2017b).
Several strategies have been developed in recent years to account for both gene flow and incomplete lineage sorting (ILS) in a phylogenetic context. One is to infer phylogeny and secondarily use an isolation with migration approach to estimate migration rates on branches of interest (e.g., Hey 2010). The implicit assumption of this strategy is that gene flow will not interfere with phylogenetic signal. A second approach is the framework developed by Joly et al. (2009) to test for hybrid speciation. It is based on the assumption that under a scenario of only ILS, gene divergence times must predate speciation times, whereas in a case of hybridization the gene divergence events may occur after speciation. A third strategy utilizes phylogenetic networks, which offer a complement to traditional species tree methods but are generally constrained by lack of scalability because there are limits to the number of migration (or hybridization) events that can be modeled and limits on the number of loci and individuals per species that can be analyzed (Than et al. 2008; Meng and Kubatko 2009; Solís-Lemus and Ané 2016). Finally, other researchers have assessed the extent to which gene flow should be taken into account in a particular system using simulations such as Approximate Bayesian Computation (ABC; Camargo et al. 2012) or posterior predictive simulation (PPS) (Gruenstaeudl et al. 2016). The diversity of approaches is a clear sign that researchers have recognized that gene flow is a process that should be considered in lower-level phylogenetic investigations, even if it is not yet clear exactly which strategies are the most appropriate in a given empirical system.
Here, we evaluate the extent to which gene flow could be problematic for phylogeny estimation and species delimitation in North American Myotis bats, a clade in which paraphyletic patterns have been attributed to gene flow among species (Dewey 2006; Carstens and Dewey 2010). We specifically focus on described subspecies within Myotis lucifugus, a widely distributed nominal species suspected to include 4–6 independent evolutionary lineages that potentially exchange alleles with a variety of other Myotis (Carstens and Dewey 2010). After collecting genomic data and conducting delimitation analyses that support these suspicions, we develop a novel ABC framework that uses gene tree distance metrics to summarize the data and to calculate the relative influence of gene flow on a locus-by-locus basis. Finally, we infer phylogenetic relationships using a network approach that accommodates genetic exchange among species.
The Myotis lucifugus Complex
Inferring the species boundaries and phylogenetic relationships within Myotis has challenged evolutionary biologists because ecomorphs associated with phenotypic characteristics in this genus appear to have evolved independently in different subclades (Ruedi and Mayer 2001; Stadelmann et al. 2007; Ghazali et al. 2016) to the extent that several Myotis species are phenotypically similar without necessarily being closely related. This phenotypic convergence has the potential to influence phylogeny inference, even in well-studied species, because it complicates the recognition of species boundaries and taxon sampling. At least compared to other bat genera, Myotis is well-studied. For example, M. lucifugus a member of the North American Nearctic subclade (Stadelmann et al. 2007; Ruedi et al. 2013), has been investigated from physiological, embryological, and ecological perspectives (e.g. Kwiecinski et al. 1987; Schutt et al. 1994; Norquay and Willis 2014; Czenze and Willis 2015) and has a full genome available (Ensembl release 59; Flicek et al. 2014). Despite these efforts, there is uncertainty regarding the species and subspecies status within M. lucifugus (Table 1). While five subspecies are currently recognized (Myotis lucifugus alascensis, Myotis lucifugus carissima, Myotis lucifugus lucifugus, Myotis lucifugus pernox, and Myotis lucifugus relictus; Simmons 2005), some of these descriptions are based on geographic ranges and subtle qualitative differences in morphological characters. For example, M. l. pernox was described as “… externally resembling M. l. lucifugus, but with larger foot …” (Hollister 1911). Furthermore, while Simmons (2005) treats Myotis occultus as a distinct species, earlier authors (Barbour and Davis 1969; Valdez et al. 1999) included occultus as a subspecies within M. lucifugus on the basis of genetic and morphological similarities.
Phylogenetic position of Myotis lucifugus in several studies
Mvol
Phylogenetic position of Myotis lucifugus in several studies
Mvol
The phylogenetic position of M. lucifugus remains uncertain (Table 1). For example, Ruedi and Mayer (2001) used two mitochondrial markers, cytochrome-B and ND-1, to estimate phylogenetic relationships among Myotis bats and their findings showed M. lucifugus and Myotis thysanodes as sister species. Stadelmann et al. (2007) used cytochrome-B and a nuclear gene (Rag2) and augmented the number of taxa sampled from North America. Their results described M. lucifugus as the sister taxon of M. occultus, this group being sister to the western long-eared bats (Myotis evotis, M. thysanodes, and Myotis keenii). Larsen et al. (2012) studied the phylogenetic relationships of Myotis bats in the Americas using cytochrome-B and an extensive sampling of South American species, many of which were lacking from previous studies. Their results showed M. lucifugus as a sister species to the western long-eared species, and M. occultus as the most divergent lineage within the clade, a result also supported by Ruedi et al. (2013). The aforementioned phylogenetic studies only included a few samples of one subspecies (i.e., M. l. carissima). While most authors assume that the Myotis lucifugus subspecies are monophyletic, genetic evidence from mitochondrial data in Dewey (2006) suggested that they may be paraphyletic.
Material and Methods
Sampling, Capture Probe Design, Library Preparation, and Sequencing
Data were collected from 131 individuals (
Map showing sampling localities of Myotis species in the Nearctic subclade. Shaded areas represent the distribution of M. lucifugus recognized subspecies. Circles show sampling for M. lucifugus lineages. Diamonds show sampling for the western long-eared (WLE) species M. evotis, M. thysanodes, and M. keenii. Squares show sampling for non-WLE species M. occultus, M. sodalis, and M. volans.
A set of capture probes for ultraconserved elements (UCEs) was previously designed to infer species relationships among the western long-eared Myotis bats (Morales et al. 2017), which used the Myotis lucifugus genome as a reference (Ensembl release 59; Flicek et al. 2014), and were derived from the set of probes designed for placental mammals (McCormack et al. 2012). These loci have been informative for phylogenetic and population level inferences going from deep to shallow timescales (e.g., Smith et al. 2014; Meiklejohn et al. 2016; Morales et al. 2017). The Myotis probes were purchased from MYcroarray, Inc. (Ann Arbor, MI, USA) and used to enrich all samples included in this study.
Genomic DNA was isolated using a DNeasy blood and tissue kit (QIAGEN). To increase the amount of DNA in some samples that were donated by museum collections, whole genome amplifications were performed using REPLI-g kits (QIAGEN). DNA fragmentation and TruSeq library preparation followed the methodology described by Morales et al. (2017). Enrichment of capture probes was performed following the protocol provided by MYcroarray, which is based on the workflow described by Gnirke et al. (2009). The final library was sequenced at the Georgia Genomics Facility using 1.5 lanes of an Illumina NextSeq 150 bp paired-end high-output flow cell run.
Contigs Assembly, Mapping, and Haplotype Reconstruction
Raw reads were demultiplexed using Casava (Illumina, Inc.). Quality of the reads was assessed and adapters were trimmed using illumiprocessor (Faircloth 2013) and Trimmomatic v.0.32 (Bolger et al. 2014). Consensus contigs were generated de novo using VelvetOptimizer v.2.2.5 (http://www.vicbioinformatics.com/software.velvetoptimiser.shtml last accessed 26 July 2016) and Velvet assembler v.1.2.09 (Zerbino and Birney 2008). Consensus contigs were aligned to target loci using LASTZ (Harris 2007). Any contigs that did not match the target loci or that matched more than one locus were removed. Retained contigs were mapped against an index of target loci, using BWA-MEM v. 0.7.8-r455 (Li and Durbin 2009). Individual consensus sequences were generated using SAMTools v.0.1.19 (Li et al. 2009). Files were imported from SAM to BAM format, sorted, and indexed. Files in VCF and FASTQ format (with hard-masked low quality bases <Q20) were then generated and seqtk was used to covert output-masked FASTQ files to FASTA format (https://github.com/lh3/seqtk last accessed 26 July 2016). Each locus was aligned using MAFFT v7.123b (Katoh and Standley 2013). To resolve gametic phase in multiple heterozygous sites, PHASE v.2.1.1 was used (Stephens et al. 2001; Stephens and Donnelly 2003). All bioinformatic analyses were performed through the Ohio Supercomputer Center, last accessed November 2017.
Summary Statistics
To assess the level of polymorphism of phased loci, summary statistics were calculated for each locus across all lineages using PopGenome v.2.2.3 (Pfeifer et al. 2014). Overall, 1370 UCEs were analyzed (unless otherwise noted) to estimate the total number of segregating sites (S), number of haplotypes (Hap), haplotype diversity (Hd), nucleotide diversity (
In order to select a model of sequence evolution, maximum likelihood scores under each model were calculated using PHYML (Guindon and Gascuel 2003), and MrAIC (Nylander 2004; last accessed January 2017.) was used to calculate the Akaike’s Information Criterion (AIC) and conduct model selection.
Species Delimitation and Species Tree Reconstruction of M. lucifugus Lineages
Previous work suggests both that M. lucifugus subspecies may be evolving independently and that gene flow may be occurring among closely related Myotis (e.g., Carstens and Dewey 2010; Morales et al. 2017). To evaluate if M. lucifugus subspecies are lineages evolving independently, we performed a Bayesian analysis to simultaneously delimit lineages and infer their relationships as implemented in BPP v.3.3 (Yang and Rannala 2014). Putative lineages (specified a priori) are collapsed sequentially to determine both the best delimitation scenario and a species tree. BPP uses a reversible jump Markov chain Monte Carlo approach (Yang and Rannala 2010) to calculate posterior probabilities for models of several numbers of potential lineages using gene trees and incorporating uncertainty of phylogenetic inference in a Bayesian framework (Yang and Rannala 2014). For this analysis, individuals were assigned to groups following the subspecies classification. Alignments of 125 UCE loci were used as input; this is the number of loci recovered for all lineages in at least one individual. The most compelling justification for this reduced data set was that only one sample (two alleles) from M. l. pernox, which has a restricted distribution (Fig. 1), was available and sequencing results were suboptimal from this sample. Starting priors for theta (
To compare the species relationships coestimated in BPP during the species delimitation analysis, a species tree was estimated in *BEAST v.1.8.1 (Heled and Drummond 2010; Drummond et al. 2012). This analysis was performed using the same set of loci used in the delimitation analysis with the following settings: base frequencies were set to empirically estimated values, a relaxed molecular clock, a Yule process on the species tree prior with a random starting tree, 100 million generations sampled every 10,000 steps, and a burnin of 10%. The same model of sequence evolution (HKY) was used for all loci as this model was found to be representative for all of them. Convergence was assessed in Tracer v.1.6 (Rambaut et al. 2014; last accessed January 2017.) based on effective sample size (ESS) values equal or greater than 200, and maximum Clade Credibility trees were built using Tree Annotator v.1.8.1 (Heled and Drummond 2010).
Posterior Predictive Evaluation of Species Tree Model
To assess the fit of the multispecies coalescent model (MSCM) used in *BEAST to estimate gene trees and the species tree, PPS analyses were performed as implemented in P2C2M v.0.7.6 (Gruenstaeudl et al. 2016). P2C2M compares the posterior and posterior predictive distributions of each gene tree and the species tree using PPS and uses summary statistics to calculate the differences between distributions. These summary statistics account for the number of deep coalescences (ndc), the coalescent likelihood (coal), and the likelihood of coalescent waiting times (lcwt). The expectation is no difference between distributions when the MSCM has a good fit to the data. This analysis was performed with 100 replicates.
Estimating the Phylogenetic Placement of M. lucifugus Lineages
Species trees were estimated using two coalescent-based approaches. The first was SVDquartets (Chifman and Kubatko 2014, 2015), as implemented in PAUP v.4.0a149 (Swofford 2002). SVDquartets is a single-site method that computes a quartet score based on singular value decomposition of a matrix of site pattern frequencies that correspond to a split on a phylogenetic tree. Quartet scores are used to select the best-supported topology for quartets of taxa, and these scores are then used to infer the species tree. Considering that our data set has 262 tips (alleles), there are nearly 1 billion of possible quartets, conducting an exhaustive analysis of all possible quartets may have consumed thousands of computational hours. We thus used a subsampling approach, where one allele per lineage was sampled at random with replacement such that 1000 independent data sets were built and analyzed. Each analysis was performed with 100 bootstrap replicates to calculate support values. In order to summarize these trees, a majority rule consensus tree was subsequently inferred in PAUP.
Species trees were also estimated using ASTRAL v. 5.0.3 (Mirarab et al. 2014). ASTRAL takes as input rooted or unrooted gene trees and maximizes the number of quartet trees shared between the gene trees and the species tree. Input gene trees were estimated in RAxML v. 8.1.16 (Stamatakis 2014). Each locus was analyzed using 10 independent runs, a GTR
These analyses were performed using the entire data set, including 1370 loci and 12 lineages, considering each lineage of M. lucifugus a species and including the closest related species to this group, M. evotis, M. keenii, M. occultus, Myotis sodalis, M. thysanodes, and Myotis volans. Myotis riparius was defined as an outgroup (Table 2).
Mean values and standard deviation (
. | Individuals [loci] . | Sequence length in bp . | Segregating sites . | Haplotypes . | Haplotype diversity . | Nucleotide diversity . | Watterson’s theta . | Tajima’s D . |
---|---|---|---|---|---|---|---|---|
All lineages | 131 [1370] | 132 | 38 | 12 | 0.33 | 0.02 | 9.02 | |
(43) | (26) | (7) | (0.16) | (0.02) | (5.95) | (0.37) | ||
Mluc_alas | 9 [1362] | 130 | 21 | 1.5 | 0.63 | 11.95 | 10 | 12 |
(47) | (22) | (0.7) | (0.11) | (13.05) | (10.71) | (13.05) | ||
Mluc_cari | 5 [1307] | 130 | 3 | 1.5 | 0.64 | 1.79 | 1 | 2 |
(47) | (4) | (0.6) | (0.09) | (2.53) | (2.08) | (2.53) | ||
Mluc_luci | 22 [1370] | 128 | 28 | 4.0 | 0.38 | 4.91 | 7 | 5 |
(47) | (21) | (2.4) | (0.15) | (4.23) | (5.73) | (4.23) | ||
Mluc_pern | 1 [320] | 135 | na | 1.0 | na | na | na | na |
(48) | na | |||||||
Mluc_reli | 2 [1179] | 130 | 22 | 1.2 | 0.67 | 14.85 | 12 | 15 |
(48) | (21) | (0.4) | (0.02) | (14.39) | (11.79) | (14.39) | ||
Mocc | 3 [812] | 132 | 9 | 1.2 | 0.64 | 0.77 | 5 | 6 |
(49) | (16) | (0.4) | (0.08) | (4.35) | (8.78) | (10.73) | ||
Mevo | 34 [1369] | 128 | 10 | 4.8 | 0.45 | 1.21 | 2 | 1 |
(47) | (9) | (3.1) | (0.16) | (1.32) | (2.16) | (1.32) | ||
Mkee | 12 [1346] | 129 | 5 | 2.2 | 0.37 | 1.68 | 2 | 2 |
(47) | (5.35) | (1.2) | (0.31) | (1.99) | (1.91) | (1.99) | ||
Mthy | 20 [1362] | 128 | 7 | 3.4 | 0.59 | 1.36 | 2 | 1 |
(47) | (7) | (2.2) | (0.17) | (1.89) | (2.13) | (1.89) | ||
Msod | 11 [968] | 127 | 22 | 1.6 | 0.58 | 4.45 | 10 | 12 |
(49) | (20) | (1.1) | (0.15) | (9.68) | (10.34) | (12.74) | ||
Mvol | 9 [1368] | 128 | 10 | 2.0 | 0.43 | 1.86 | 3 | 3 |
(47) | (14) | (1.0) | (0.15) | (4.03) | (4.93) | (4.83) | ||
Mrip | 3 [1342] | 129 | 16 | 1.5 | 0.64 | 4.19 | 8 | 9 |
(47) | (20) | (0.7) | (0.09) | (9.30) | (9.89) | (12.06) |
2.21 |
Mluc_alas
Mean values and standard deviation (
. | Individuals [loci] . | Sequence length in bp . | Segregating sites . | Haplotypes . | Haplotype diversity . | Nucleotide diversity . | Watterson’s theta . | Tajima’s D . |
---|---|---|---|---|---|---|---|---|
All lineages | 131 [1370] | 132 | 38 | 12 | 0.33 | 0.02 | 9.02 | |
(43) | (26) | (7) | (0.16) | (0.02) | (5.95) | (0.37) | ||
Mluc_alas | 9 [1362] | 130 | 21 | 1.5 | 0.63 | 11.95 | 10 | 12 |
(47) | (22) | (0.7) | (0.11) | (13.05) | (10.71) | (13.05) | ||
Mluc_cari | 5 [1307] | 130 | 3 | 1.5 | 0.64 | 1.79 | 1 | 2 |
(47) | (4) | (0.6) | (0.09) | (2.53) | (2.08) | (2.53) | ||
Mluc_luci | 22 [1370] | 128 | 28 | 4.0 | 0.38 | 4.91 | 7 | 5 |
(47) | (21) | (2.4) | (0.15) | (4.23) | (5.73) | (4.23) | ||
Mluc_pern | 1 [320] | 135 | na | 1.0 | na | na | na | na |
(48) | na | |||||||
Mluc_reli | 2 [1179] | 130 | 22 | 1.2 | 0.67 | 14.85 | 12 | 15 |
(48) | (21) | (0.4) | (0.02) | (14.39) | (11.79) | (14.39) | ||
Mocc | 3 [812] | 132 | 9 | 1.2 | 0.64 | 0.77 | 5 | 6 |
(49) | (16) | (0.4) | (0.08) | (4.35) | (8.78) | (10.73) | ||
Mevo | 34 [1369] | 128 | 10 | 4.8 | 0.45 | 1.21 | 2 | 1 |
(47) | (9) | (3.1) | (0.16) | (1.32) | (2.16) | (1.32) | ||
Mkee | 12 [1346] | 129 | 5 | 2.2 | 0.37 | 1.68 | 2 | 2 |
(47) | (5.35) | (1.2) | (0.31) | (1.99) | (1.91) | (1.99) | ||
Mthy | 20 [1362] | 128 | 7 | 3.4 | 0.59 | 1.36 | 2 | 1 |
(47) | (7) | (2.2) | (0.17) | (1.89) | (2.13) | (1.89) | ||
Msod | 11 [968] | 127 | 22 | 1.6 | 0.58 | 4.45 | 10 | 12 |
(49) | (20) | (1.1) | (0.15) | (9.68) | (10.34) | (12.74) | ||
Mvol | 9 [1368] | 128 | 10 | 2.0 | 0.43 | 1.86 | 3 | 3 |
(47) | (14) | (1.0) | (0.15) | (4.03) | (4.93) | (4.83) | ||
Mrip | 3 [1342] | 129 | 16 | 1.5 | 0.64 | 4.19 | 8 | 9 |
(47) | (20) | (0.7) | (0.09) | (9.30) | (9.89) | (12.06) |
2.21 |
Mluc_alas
Comparing Phylogenetic Models With and Without Gene Flow
Given the unexpected findings showing paraphyly in the M. lucifugus subspecies, we explored whether a topology that includes gene flow might be a better option to describe the species relationships of this group. Previous model-based frameworks have been developed to estimate species relationship while accounting for gene flow (e.g., PHRAPL; Jackson et al. 2017a,b). However, because the number of potential models increases faster than factorial, exhaustive model testing with scenarios that include more than four lineages are not computationally feasible with approaches like PHRAPL or any other tool developed so far. Therefore, we implemented a novel ABC approach that was designed to evaluate support for phylogenetic models that include and exclude intraspecific gene flow (Fig. 2), where underlying topologies are “known” (can be prespecified), therefore the number of models tested is considerably smaller. Because this framework relies on topologies, we used two gene tree metrics as summary statistics, the Robinson–Foulds (RF; Robinson and Foulds 1981) and Kuhner–Felsenstein (KF; Kuhner and Felsenstein 1994) distances. The main difference between these metrics is that RF measures differences in topology only while KF measures differences in both topology and branch lengths. RF measures the number of internal branches that exist in one tree but not in other. On the other hand, KF uses the branch tree information to calculate the sum of squares of the differences between each branch length in the two trees compared. For both metrics, a value of 0 means identical topologies and it increases as the match between trees worsens.
Workflow of the methodology followed to compare phylogenetic trees with and without gene flow among taxa and calculate the support of each model.
In summary, our framework (1a) simulates gene trees under specified models (phylogenies with and without gene flow) using ms (Hudson 2002). The underlying topology of each model can be estimated with traditional species tree methods that do not account for gene flow (e.g., ASTRAL, SVDQuartets), and prior distributions of parameters such as divergence times, population sizes, and migration rates are prespecified. As in any model-testing framework, the hypotheses tested and parameters used to simulate data will be based on previous knowledge of the system, and model misspecification could be a problem. (1b) Given that empirical gene trees may include multiple alleles per species and, for computational feasibility, simulated trees only include two alleles per taxa, a subsampling with replacement approach was thus used to account for disparities across loci in the number of alleles and simplify computations of distances between simulated and observed trees. In this way, all compared trees have the same number of tips, and because each gene tree is subsampled at random hundreds of times, all alleles have the same probability of being included in the analysis. (2) Then, the RF and KF distances are calculated between simulated and empirical trees (after subsampling), and (3) a posterior distribution of gene tree distances is built by retaining only those trees that are among the closest 0.02 trees from the empirical data. The rejection threshold can be adjusted as desired; here, we retained 2% of the trees as we consider this is a conservative value. (4–5) Finally, the posterior probability of phylogenies that do and do not include gene flow can be approximated by calculating their proportional representation in the posterior distribution. Our approach scales easily to the genomic data collected here because posterior probabilities are calculated in parallel on a locus-by-locus basis, then the results are summarized across loci. A pipeline including a set of custom R functions was written to conduct Steps 1–5 (https://github.com/ariadnamorales/ABC_gene_trees).
Simulated data
The performance of this approach to accurately differentiate between models with and without gene flow was assessed via simulation testing, emulating our Myotis data set with 11 lineages and models of interspecific gene flow. First, a species tree was simulated and used as underlying topology to design three models as follows: not allowing gene flow between lineages, allowing gene flow among sister species, and allowing gene flow among nonsister species. To simulate 100,000 trees under each model (Step 1a), prior distributions for divergence time (
Empirical data
Data collected from 1370 UCE in Myotis bats were analyzed as described above. To simulate trees under each model (Step 1a), prior distributions for divergence time (
Phylogenetic Networks
A maximum pseudolikelihood (pL) approach as implemented in SNaQ was used to estimate phylogenetic networks (Solís-Lemus and Ané 2016). Pseudolikelihood frameworks are useful to analyze genomic data sets because computationally they are easier to estimate and they can approximate faster the likelihood function of a set of observed data. The SNaQ framework uses gene trees and a species tree as a starting point to perform a search that accounts for ILS under the coalescent model, and horizontal transfer of genes under reticulations events in a network. Gene trees are summarized as quartet concordance factors, including all possible combinations of four taxa. However, given that our data set has 262 tips (i.e., two alleles per individual), there are nearly 1 billion possible quartets, thus preventing an exhaustive analysis, we used a subsampling approach, where one allele per lineage was sampled at random with replacement such that 100 subsampled trees per locus were analyzed. This analysis was performed using gene trees from 1370 loci. Therefore, after subsampling, 137,000 trees with one tip per lineage were used as input to calculate a table of quartet concordance factors. Gene trees were estimated in RAxML, and the starting species tree was estimated in ASTRAL as described above, where M. riparius was set as outgroup. Gene trees were subsampled and relabeled with a custom script. To determine whether a tree with ILS or a network fits the observed data, 11 estimations, each with 10 replicates, were performed allowing a maximum number of reticulation nodes or genetic exchange events (
Results
Summary Statistics
Tissue samples were acquired from 131 individuals including Myotis lucifugus recognized subspecies (alascensis, carissima, lucifugus, pernox, and relictus; Simmons 2005) and closely related members of the Nearctic subclade (Fig. 1; M. evotis, M. keenii, M. occultus, M. sodalis, M. thysanodes, and M. volans; Ruedi et al. 2013). Data were also collected from M. riparius and used as an outgroup. Using the custom sequence capture probe set developed by Morales et al. (2017), sequence data were collected from 1.5 Illumina NextSeq lanes which produced more than 800 million paired-end reads. Contigs were assembled and only those with coverage greater than 10
Species Delimitation and Species Tree Reconstruction of M. lucifugus Lineages
Genetic evidence has suggested that several lineages within M. lucifugus are evolving independently in the presence of gene flow (Carstens and Dewey 2010). To evaluate whether the five subspecies described should be considered independent lineages, we conducted species delimitation and estimated species relationships simultaneously using BPP (Yang and Rannala 2010). Results from this analysis indicate that there are five species (pp
Posterior Predictive Evaluation of Species Tree Model
While the *BEAST results were questionable (ESS <200 in most cases), PPS using P2C2M indicate that the species tree model utilized in *BEAST does not have a good fit to the data, as fewer than 20% of the gene trees show a good fit to the MSCM (loci with good fit according to coal
Estimating the Phylogenetic Placement of M. lucifugus Lineages
To infer the phylogenetic placement of M. lucifugus lineages with respect to other Myotis from the Nearctic clade, a species tree was estimated using two coalescent-based approaches. These results demonstrate that M. lucifugus subspecific lineages, suspected to be independent by Carstens and Dewey (2010) and supported by our BPP analysis, are paraphyletic. However, the phylogenetic placement these lineages in a species tree changes depending on the method used (Fig. 3). In one hand, the phylogeny estimated by SVDQuartets recovered two main clades. In the first clade, M. l. carissima is sister to M. occultus and M. thysanodes, M. evotis and M. keenii are sister species, and M. l. pernox and M. l. alascensis are the most divergent species within the group. In the second clade, M. l. lucifugus and M. sodalis are sister species, M. l. relictus is sister to the former, and M. volans is the most divergent species within this group. Bootstrap support values at each node were higher than 98 (Fig. 3a). On the other hand, in the phylogeny estimated by ASTRAL, M. l. pernox is sister to the western long-eared group (M. evotis, M. thysanodes, and M. keenii), M. l. carissima is sister to this clade with M. occultus as the most divergent species of this clade. Myotis l. alascensis and M. l. relictus are members of a different clade, where M. l. relictus is sister to M. volans, and M. l. alascensis is sister to them. Finally, M. l. lucifugus and M. sodalis are the most divergent lineages within this group. Bootstrap support values at each node were higher than 98 whereas ASTRAL support values ranged from 0.27 to 0.49, with effective numbers of gene trees per branch ranging from 311 to 1361 (Fig. 3b). For five lineages that were only described as subspecies on the basis of subtle morphological differences and have been studied by several decades as the same entity, these results are unexpected.
Species trees estimated for Myotis species in the Nearctic subclade using a) SVDQuartets and b) ASTRAL. Gene trees from 1370 UCE loci were analyzed and M. riparius was set as outgroup. Lineages described as M. lucifugus subspecies are labeled to be the lighter color, and other North American Myotis species are labeled in the darkest. Left circles between nodes represent support values higher than 98 calculated based on 100 multilocus bootstrap replicates for SVDQuartets and ASTRAL analyses. Right circles represent the local posterior probability (PP1) estimated in ASTRAL, the size and color increase from low (light and small circles) to high support (dark and big circles). In the ASTRAL tree, numbers between nodes show the effective number of gene trees per branch.
Comparing Phylogenetic Models With and Without Gene Flow
Given the paraphyly inherent to M. lucifugus subspecies, we explored whether a species tree that includes gene flow might be a better option to describe the evolutionary history of this group. We implemented a framework based on ABC using gene tree distances as summary statistics to compare phylogenetic models in the presence and absence of gene flow. This approach allowed us to test models of isolation and gene flow between species in sympatric and allopatric distributions. Simulation testing of the proposed ABC method demonstrates that it is likely to identify phylogenies biased by gene flow, as results indicate that the true model is identified in >98% of replicates using either KF or RF distances (Fig. 4).
Proportion of loci supporting phylogenetic models with and without gene flow. Results from data collected in North American Myotis species (a) and results for simulated gene trees under a model of gene flow among nonsister species (b), gene flow between sister species (c), and no gene flow (d). Kuhner–Felsenstein (KF) and Robinson–Foulds (RF) distances.
When data from Myotis bats were analyzed with a fixed topology and migration matrices designed based on sympatric distributions, models with gene flow achieved the highest support regardless of the distance metric used (Fig. 4). However, the posterior probability of the model given the data is dependent on the metric used. The model with gene flow between all the sampled North American Myotis showed the highest posterior probability (KF
Phylogenetic Networks
Given our results confirming gene flow between nonsister taxa in North American Myotis, we conducted network analyses to infer species relationships while accounting for genetic exchange (or reticulations). Networks estimated by SNaQ also demonstrate that M. lucifugus subspecies are not part of a monophyletic clade (Fig. 5). The species tree analysis (
a) Plot showing log-pseudolikelihood values obtained from the SNaQ analysis when given values were allowed as the maximum number of reticulation nodes
Discussion
Gene flow among mammals was once thought to be rare, however, recent findings have suggested that it may be a common process that can complicate species relationship inferences (e.g., Melo-Ferreira et al. 2014; Sullivan et al. 2014; Kumar et al. 2017). We document an example of interspecific gene flow among nonsister species that occur in sympatry in a genus of bats, where genetic exchange may be hindering phylogenetic signal. Species delimitation results presented here confirmed previous suspicions (i.e., Carstens and Dewey 2010) of multiple species within the currently recognized M. lucifugus, but may be influenced by the presence of gene flow in this system. As evident in the results from the P2C2M analysis, the MSCM implemented in BPP is not a good fit to our data, likely because alleles are shared among nonsister lineages due to gene flow. While the phylogenetic analyses demonstrate paraphyletic patterns among these lineages, in systems like Myotis, phylogenetic inferences are challenging. Our strategy involved using multiple methods with differing assumptions, with the hope that we could identify congruence across phylogeny estimates. However, while species tree methods like ASTRAL and SVDQuartets show the M. lucifugus lineages as paraphyletic groups, the recovered topologies are conspicuously different. We question the tree inferred using SVDQuartets because the western long-eared species (M. evotis, M. thysanodes, and M. keenii) do not form a clade as they do in the ASTRAL analysis, which matches previous phylogeny estimates using everything from mitochondrial (Dewey 2006) to UCE data (Morales et al. 2017; Platt et al. 2018). However, given that neither method incorporates gene flow, the pertinent question for interpretation is the extent to which gene flow leads to inaccuracy in the estimation of either the topology or the divergence times. Simulation studies have suggested that species relationships and divergence time inferences are possible under certain conditions that include gene flow (Eckert and Carstens 2008; Leache et al. 2014). The challenge for empiricists is to understand how estimates of phylogeny may have been biased in their system. For example, could gene flow between nonsister lineages in our system lead to the apparent paraphyly and unclear lineage relationships of the M. lucifugus subspecies? Our results indicate that geographic proximity appears to be predictive of clade relationships in the species tree. For instance, M. l. alascensis and M. l. relictus exhibit contiguous distributions that are partly sympatric, these are also in sympatry with M. volans, and these three lineages form a clade. Similarly, M. l. pernox and M. l. carissima form a clade with the western long-eared species, to which they are broadly sympatric. Our skepticism about the topology estimated in the ASTRAL analysis is largely based on these patterns.
It is clear that methods development is needed if we are to fully take advantage of the recent advances in sequencing technology that have enabled researchers to collect genomic data sets in nonmodel systems such as Myotis bats. Here, we applied two analyses designed to evaluate if a species tree is a reasonable model for North American Myotis. First, we developed an ABC approach based on gene tree metrics to compare phylogenetic hypotheses that can incorporate gene flow. Results from these estimations support the suggestion that gene flow in North American Myotis occurs at interspecific levels whenever species come into sympatry, regardless of whether they are sister species or not. However, it remains to be explored if genetic interchange occurred in the past during early stages of divergence (i.e., speciation with gene flow), or if it occurred after lineages diverged in isolation (i.e., secondary contact), or if it has occurred constantly (i.e., genomic regions are subject to strong divergent selection to maintain species barriers despite gene flow). Based on previous investigations including the western long-eared species showing that gene flow occurred only during early stages of divergence (Morales et al. 2017), we suspect that this could be the case for the North American clade. Secondarily, results from phylogenetic network estimations support that at least four events of genetic exchange have occurred and at least one these occurred deeper in the topology. Even though the species relationships are different from those inferred by species tree methods, these findings suggest that genetic exchange occurred at early stages of divergence among some ancestral taxa and current lineages might be at different stages of the speciation spectrum. Simulation studies using genomic data and large trees have suggested that when genetic exchange has occurred between ancestral nodes, hybrid detection methods are good at identifying taxa in which gene flow has occurred, but these methods are not as effective to unambiguously picking out parental or sister species (Kubatko and Chifman 2015). Given that ABC results suggest that reticulations have occurred more than four times and some of these represent ancestral events, species relationships might be challenging to infer with current tools, especially among lineages that recently diverged.
Our results emphasize the need for new phylogenetic tools for species tree inference that can accommodate genetic exchange among lineages. While most species tree methods assume that incongruence between gene trees and species trees results from ILS, our investigation (among others) demonstrates that such assumptions might not suffice. Phylogenetic networks provide one alternative and current implementations represent a great advance to infer whether reticulation events occur between ancestral or more derived nodes, but have limitations, including the amount of data, number of individuals per taxon and number of taxa. We have addressed some of these issues by implementing a subsampling approach, where one or a few alleles per species were sampled hundreds of times for each locus and analyzed independently. Even when we drastically reduced our data set and used only a single individual per species, computational time was substantial, and it became a constraint as we attempted to analyze multiple individuals per lineage. Tools like ASTRAL have addressed this problem allowing multiple individuals per lineage, missing taxa from gene trees and still being able to include them in the species tree estimation at the cost of ignoring genetic exchange. Alternatively, methods such as PHRAPL (Jackson et al. 2017b) can handle big data sets, allow missing data and account for parameters such as divergence and gene flow but are limited by the number of lineages that can be analyzed due to model space conflicts.
Taxonomic and Conservation Implications on North American Myotis
Myotis bats are one of the most taxonomically complicated groups within the order Chiroptera. The apparent morphological resemblance between species that occur in sympatry and potential interspecific gene flow have led to contradicting species relationships in studies based on generic morphological features (i.e., length of the forearms, length of the skull) and a few loci. It is not surprising that novel phylogenetic analyses with genomic data are contradicting previously described species relationships and species limits. Here, we have shown that the M. lucifugus subspecies recognized in Simmons (2005) are lineages evolving independently that seem to be paraphyletic, we thus suggest they should be considered independent species, under the phylogenetic species concept, as follows: M. alascensis, M. carissima, M. lucifugus, M. relictus, and M. pernox. For five lineages that were only described as subspecies on the basis of morphological similarities and have been studied by several decades as the same entity, this result is unexpected. Even though these results might represent a big change in the current taxonomy of North American Myotis, similar changes have been derived from studies using molecular data. For example, when DNA analyses of two mitochondrial loci were used for the first time to infer a phylogeny of this genus, species thought to be closely related, given their phenotypic resemblance, were found to be paraphyletic (Ruedi and Mayer 2001). Moreover, genomic studies using UCEs for phylogenetic inferences of Myotis in the Americas have also found paraphyletic patterns in M. lucifugus lineages and estimated species relations that contradict previous findings based on mitochondrial loci (Platt et al. 2018). However, we do not argue that current topologies, estimated here or in other studies, represent conclusive evidence given the lack of methods to incorporate gene flow into phylogenomic estimations. The morphological resemblance among nonsister species leads to suspect that other evolutionary forces might have a great effect on the phenotypic similarities within this group. For example, strong environment-driven selection might be occurring in some genomic regions, favoring similar phenotypes among sympatric species that exchanged genes at some point in their evolutionary history. To address current confusions due to apparent morphological similarity among these lineages, exhaustive taxonomic revisions based on geometric morphometric and genomic analyses including coding regions coupled with novel analytical approaches are required to identify features that could help to reclassify specimens within this genus. Finally, our findings also shed light upon conservation priorities. For instance, M. relictus and M. pernox have small distributions and biological surveys are urgent to evaluate the status of their populations, which could be vulnerable or under threat. Myotis pernox represents a priority due to their range is less than 1000 km from localities in Washington were white-nose syndrome occurrences were reported in March 2016 (https://www.whitenosesyndrome.org).
In summary, we document an example of interspecific gene flow among nonsister species in a genus of mammals, where genetic exchange may be hindering phylogenetic signal. Our findings demonstrate that there are at least five independent lineages within the M. lucifugus group, which are paraphyletic and should be considered different species. In addition, results of simulation analyses and model testing demonstrated that gene flow should be considered while species relationships are inferred because it is a common process that has occurred throughout the evolutionary history of Myotis bats in North America. We implemented an approach for explicit modeling of genetic exchange among sympatric and allopatric lineages that can be extended to other taxonomic groups to explore if gene flow should be accounted for during phylogenetic estimations.
Supplementary Material
Data and Supplementary Material available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.9d6b2.
Funding
The National Science Foundation funded this research (DEB-1257784) and additional founding was provided through a Doctoral Dissertation Improvement Grant (DEB-1701810) awarded to A.E.M. and B.C.C. The Ohio Supercomputer Center allocated resources to support part of this study (PAS1184-1). Support for A.E.M. was provided in part by a graduate fellowship at The Ohio State University funded by Consejo Nacional de Ciencia y Tecnologia (Reg. 217900 CVU 324588).
Acknowledgments
We thank museum curators who lent us tissue samples of specimens under their care: AMNH, American Museum of Natural History, New York, USA; ASNHC, Angelo State Natural History Collection, Texas, USA; CIIDIR-Durango, Colección de Mamíferos, Centro Interdisciplinario de Investigación para el Desarrollo Integral Regional, Durango, México; UABC, Colección de Mamíferos, Universidad Autónoma de Baja California, México; UMMZ, University of Michigan Museum Zoology; Tanya Dewey donated DNA extractions. We thank Brian O’Meara, Nathan Jackson, and members of the Carstens lab for conversations related to this work.