Data for: PickMe: sample selection for species tree reconstruction using coalescent weighted quartets
Creators
- 1. Hobart and William Smith Colleges
- 2. University of Kentucky
- 3. A2Bio2
- 4. Oklahoma State University
Description
After collecting large data sets for phylogenomics studies, researchers must decide which genes or samples to include when reconstructing a species tree. Incomplete or unreliable data sets make the empiricist's decision more difficult. Researchers rely on ad hoc strategies to maximize sampling while ensuring sufficient data for accurate inferences. An algorithm called PickMe formalizes the sample selection process, assuming that the samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the species tree displays this topology. From this, we assign individual samples reliability scores computed as the average of a scaled version of the posterior probabilities. PickMe uses these weights to recommend which samples to include in a species tree analysis. Analysis of simulated data showed that including the samples suggested by \textit{Pickme} produced species trees closer to the true species trees than both unfiltered data sets and data sets with ad hoc gene occupancy cut-offs applied. To further illustrate the efficacy of this tool, we apply PickMe to gene trees generated from target capture data from milkweeds. PickMe indicates more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed without access to a formal methodology for sample selection. Using simulated and empirical data, we also compare \emph{PickMe} to existing sample selection methods. Inclusion of PickMe will enhance phylogenomics data analysis pipelines by providing a formal structure for sample selection.
Notes
Methods
We obtained targeted sequence data for 763 putatively single-copy nuclear loci for samples of 59 North American milkweed species, three African outgroup species, \textit{Asclepias physocarpa}, \textit{A. fruticosa}, and \textit{A. fornicata}, and one additional outgroup, \textit{Pergularia daemia} using the target enrichment baits of Weitemier et al. (2014) (Supplemental Material~\protect\ref{app:milkweed}). Data for 32 of these samples and orthologs from the genome sequence of \textit{Asclepias syriaca} \citep{weitemier2019draft} were included in the analyses of \cite{BOUTTE2019106534}, and nuclear sequence data for the additional 30 samples were generated using the DNA sequencing and assembly methods described therein. \cite{BOUTTE2019106534} had excluded the 30 newly analyzed samples based on an ad hoc minimum gene recovery criterion of 600 genes (79\%) with the goal of high gene occupancy for all samples for species tree analyses. For the analyses conducted here, we masked assembled sequences with Ns for very low read depth ($\le 2$ reads) and at heterozygous sites (i.e., intra-individual SNPs). For each gene, we aligned masked sequences using Mafft version 7.245 with default parameters \citep{katoh2013mafft}, and removed sequences with less than 50\% of the total alignment length \citep[i.e. Type II missing data;][]{hosner2016avoiding} to reduce gene tree error following \cite{sayyari2017fragmentary} and \cite{mirarab2019species}.
For further analysis, we selected a subset of 703 genes, which had been identified by \cite{BOUTTE2019106534} as producing the best-resolved milkweed phylogenies based on bootstrap support across the gene trees. For the complete data set of 63 species, we first estimated the 703 gene trees using Neighbor-Joining on uncorrected distances (the proportion of observed differences in the aligned sequences) as implemented in the ape package \citep{paradis2018ape} in R v. 3.5.1 \citep{R}. Using these estimated gene trees, we identified the samples to be included in species tree analyses using \emph{PickMe}. To determine whether the gene tree inference method affected the sample selection results, we also used the GTR+Gamma model in RAxML v. 8.2.12; \citep{stamatakis2014raxml} to estimate the initial gene trees. For the set of samples identified as reliable by \emph{PickMe}, we realigned the sequences and then removed small alignments ($< 100$ bp) following \cite{BOUTTE2019106534}. We then used IQ-Tree v. 1.5.4 \citep{nguyen2014iqtree,chernomor2016terrace} to select the best model of molecular evolution for the retained alignments and inferred the gene tree for each locus using the same parameters as \cite{BOUTTE2019106534}. Using ASTRAL-II v. 4.10.12 \citep{mirarab2015astral} with default parameters, we inferred a species tree and calculated local posterior probability support \citep{sayyari2016fast}. We calculated gene concordance factors using the method of \cite{Minh2020new}, implemented in IQ-Tree v. 2.1.2 \citep{nguyen2014iqtree,chernomor2016terrace}. For comparison, we repeated the gene and species tree analyses done for the subset of \textit{PickMe} reliable samples for the full data set using identical methods.
Files
Files
(296.9 kB)
Name | Size | Download all |
---|---|---|
md5:d02ed1072765c2c4d334314e6f14e590
|
7.7 kB | Download |
md5:f52fd396d47cc66d6a5bc08cbea07dff
|
11.0 kB | Download |
md5:a50906bab57ec2a92c4cd2f37a961ffc
|
1.2 kB | Download |
md5:7d9692c64e948d389f9f71ee089a18ed
|
1.5 kB | Download |
md5:bddb002824255ebc18ff11c8b6e1925e
|
164.1 kB | Download |
md5:7380f3179b71de515ac8db65ca57cb4d
|
8.6 kB | Download |
md5:52dcaf364b3506415d3e2bf343eb2fff
|
546 Bytes | Download |
md5:398ce831ea5d332c9cd67c644fe42e95
|
8.6 kB | Download |
md5:4a70c8e7c36cef4870f9f28ef6c70952
|
264 Bytes | Download |
md5:6e3eaf22a573fd538db39af6c3a5b670
|
68.9 kB | Download |
md5:dbb1161cc391eb98432f0531c6dab8bd
|
757 Bytes | Download |
md5:1257a6d31165b39ee72194d3ee761d2e
|
13.8 kB | Download |
md5:a24e64554b1348f22b7a807de0478d26
|
675 Bytes | Download |
md5:6ea6a8ee991ee4595ae77b8047681559
|
5.2 kB | Download |
md5:284b625b922f04e0a96c3780101ba8a5
|
3.9 kB | Download |
Additional details
Related works
- Is derived from
- https://github.com/jrusinko/PhyloPickMe.jl (URL)
- Is source of
- 10.5061/dryad.3r2280ggv (DOI)