Published October 14, 2024 | Version v1
Software Open

Data for: PickMe: sample selection for species tree reconstruction using coalescent weighted quartets

  • 1. Hobart and William Smith Colleges
  • 2. University of Kentucky
  • 3. A2Bio2
  • 4. Oklahoma State University

Description

After collecting large data sets for phylogenomics studies, researchers must decide which genes or samples to include when reconstructing a species tree. Incomplete or unreliable data sets make the empiricist's decision more difficult. Researchers rely on ad hoc strategies to maximize sampling while ensuring sufficient data for accurate inferences. An algorithm called PickMe formalizes the sample selection process, assuming that the samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the species tree displays this topology. From this, we assign individual samples reliability scores computed as the average of a scaled version of the posterior probabilities. PickMe uses these weights to recommend which samples to include in a species tree analysis. Analysis of simulated data showed that including the samples suggested by \textit{Pickme} produced species trees closer to the true species trees than both unfiltered data sets and data sets with ad hoc gene occupancy cut-offs applied.  To further illustrate the efficacy of this tool, we apply PickMe to gene trees generated from target capture data from milkweeds. PickMe indicates more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed without access to a formal methodology for sample selection. Using simulated and empirical data, we also compare \emph{PickMe} to existing sample selection methods. Inclusion of PickMe will enhance phylogenomics data analysis pipelines by providing a formal structure for sample selection.

Notes

Funding provided by: National Science Foundation
ROR ID: https://ror.org/021nxhr62
Award Number: DMS 1616186

Funding provided by: National Science Foundation
ROR ID: https://ror.org/021nxhr62
Award Number: DEB 1457510

Funding provided by: National Science Foundation
ROR ID: https://ror.org/021nxhr62
Award Number: DEB 1457473

Funding provided by: National Science Foundation
ROR ID: https://ror.org/021nxhr62
Award Number: DMS 1929284

Methods

We obtained targeted sequence data for 763 putatively single-copy nuclear loci for samples of 59 North American milkweed species, three African outgroup species, \textit{Asclepias physocarpa}, \textit{A. fruticosa}, and \textit{A. fornicata}, and one additional outgroup, \textit{Pergularia daemia} using the target enrichment baits of Weitemier et al. (2014) (Supplemental Material~\protect\ref{app:milkweed}). Data for 32 of these samples and orthologs from the genome sequence of \textit{Asclepias syriaca} \citep{weitemier2019draft} were included in the analyses of \cite{BOUTTE2019106534}, and nuclear sequence data for the additional 30 samples were generated using the DNA sequencing and assembly methods described therein. \cite{BOUTTE2019106534} had excluded the 30 newly analyzed samples based on an ad hoc minimum gene recovery criterion of 600 genes (79\%) with the goal of high gene occupancy for all samples for species tree analyses. For the analyses conducted here, we masked assembled sequences with Ns for very low read depth ($\le 2$ reads) and at heterozygous sites (i.e., intra-individual SNPs). For each gene, we aligned masked sequences using Mafft version 7.245 with default parameters \citep{katoh2013mafft}, and removed sequences with less than 50\% of the total alignment length \citep[i.e. Type II missing data;][]{hosner2016avoiding} to reduce gene tree error following \cite{sayyari2017fragmentary} and \cite{mirarab2019species}.

For further analysis, we selected a subset of 703 genes, which had been identified by \cite{BOUTTE2019106534} as producing the best-resolved milkweed phylogenies based on bootstrap support across the gene trees. For the complete data set of 63 species, we first estimated the 703 gene trees using Neighbor-Joining on uncorrected distances (the proportion of observed differences in the aligned sequences) as implemented in the ape package \citep{paradis2018ape} in R v. 3.5.1 \citep{R}. Using these estimated gene trees, we identified the samples to be included in species tree analyses using \emph{PickMe}. To determine whether the gene tree inference method affected the sample selection results, we also used the GTR+Gamma model in RAxML v. 8.2.12;  \citep{stamatakis2014raxml} to estimate the initial gene trees. For the set of samples identified as reliable by \emph{PickMe}, we realigned the sequences and then removed small alignments ($< 100$ bp) following \cite{BOUTTE2019106534}. We then used IQ-Tree v. 1.5.4 \citep{nguyen2014iqtree,chernomor2016terrace} to select the best model of molecular evolution for the retained alignments and inferred the gene tree for each locus using the same parameters as \cite{BOUTTE2019106534}. Using ASTRAL-II v. 4.10.12 \citep{mirarab2015astral}  with default parameters, we inferred a species tree and calculated local posterior probability support \citep{sayyari2016fast}. We calculated gene concordance factors using the method of \cite{Minh2020new}, implemented in IQ-Tree v. 2.1.2 \citep{nguyen2014iqtree,chernomor2016terrace}. For comparison, we repeated the gene and species tree analyses done for the subset of \textit{PickMe} reliable samples for the full data set using identical methods.

Files

Files (296.9 kB)

Name Size Download all
md5:d02ed1072765c2c4d334314e6f14e590
7.7 kB Download
md5:f52fd396d47cc66d6a5bc08cbea07dff
11.0 kB Download
md5:a50906bab57ec2a92c4cd2f37a961ffc
1.2 kB Download
md5:7d9692c64e948d389f9f71ee089a18ed
1.5 kB Download
md5:bddb002824255ebc18ff11c8b6e1925e
164.1 kB Download
md5:7380f3179b71de515ac8db65ca57cb4d
8.6 kB Download
md5:52dcaf364b3506415d3e2bf343eb2fff
546 Bytes Download
md5:398ce831ea5d332c9cd67c644fe42e95
8.6 kB Download
md5:4a70c8e7c36cef4870f9f28ef6c70952
264 Bytes Download
md5:6e3eaf22a573fd538db39af6c3a5b670
68.9 kB Download
md5:dbb1161cc391eb98432f0531c6dab8bd
757 Bytes Download
md5:1257a6d31165b39ee72194d3ee761d2e
13.8 kB Download
md5:a24e64554b1348f22b7a807de0478d26
675 Bytes Download
md5:6ea6a8ee991ee4595ae77b8047681559
5.2 kB Download
md5:284b625b922f04e0a96c3780101ba8a5
3.9 kB Download

Additional details

Related works