Published August 10, 2015 | Version v1
Dataset Open

Data from: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards

  • 1. University of Arizona
  • 2. Clarkson University

Description

Targeted sequence capture is becoming a widespread tool for generating large phylogenomic data sets to address difficult phylogenetic problems. However, this methodology often generates data sets in which increasing the number of taxa and loci increases amounts of missing data. Thus, a fundamental (but still unresolved) question is whether sampling should be designed to maximize sampling of taxa or genes, or to minimize the inclusion of missing data cells. Here, we explore this question for an ancient, rapid radiation of lizards, the pleurodont iguanians. Pleurodonts include many well-known clades (e.g., anoles, basilisks, iguanas, and spiny lizards) but relationships among families have proven difficult to resolve strongly and consistently using traditional sequencing approaches. We generated up to 4921 ultraconserved elements with sampling strategies including 16, 29, and 44 taxa, from 1179 to approximately 2.4 million characters per matrix and approximately 30% to 60% total missing data. We then compared mean branch support for interfamilial relationships under these 15 different sampling strategies for both concatenated (maximum likelihood) and species tree (NJst) approaches (after showing that mean branch support appears to be related to accuracy). We found that both approaches had the highest support when including loci with up to 50% missing taxa (matrices with ∼40–55% missing data overall). Thus, our results show that simply excluding all missing data may be highly problematic as the primary guiding principle for the inclusion or exclusion of taxa and genes. The optimal strategy was somewhat different for each approach, a pattern that has not been shown previously. For concatenated analyses, branch support was maximized when including many taxa (44) but fewer characters (1.1 million). For species-tree analyses, branch support was maximized with minimal taxon sampling (16) but many loci (4789 of 4921). We also show that the choice of these sampling strategies can be critically important for phylogenomic analyses, since some strategies lead to demonstrably incorrect inferences (using the same method) that have strong statistical support. Our preferred estimate provides strong support for most interfamilial relationships in this important but phylogenetically challenging group.

Notes

Files

16_taxa_0.20_bootstrap_zip.zip

Files (215.2 MB)

Name Size Download all
md5:b9c48908d15a357dd2df13edc0bcf843
4.4 MB Preview Download
md5:fca07545922e29bc8675ed7a922a8179
8.2 MB Preview Download
md5:e4199f1bd13245b23ff8753a401128b8
9.2 MB Preview Download
md5:ef42aae67457ac9e923b875c1c3ff121
9.8 MB Preview Download
md5:c59047e6dd4b70b595753ff81a5431c0
10.2 MB Preview Download
md5:ca4eb2d74c3595c2e59c99dc459ebe5d
1.4 MB Preview Download
md5:37ab4c2f152f4ae04e2c530745248718
6.7 MB Preview Download
md5:b00313a36134d21364453e449279faf8
12.3 MB Preview Download
md5:d32bdb0080b0bf2b8645581020c56814
14.6 MB Preview Download
md5:b2fe17e6e3aa9befd9dcf9522b1bb689
15.6 MB Preview Download
md5:88348426aeca671e51d2851f5fe36f34
21.3 kB Preview Download
md5:d127a598074f5c9d36a819c7f107290c
408.7 kB Preview Download
md5:4f5bc821a6744add77b6af6974a938cb
4.3 MB Preview Download
md5:da208e614a7cb15741f7e96fd27f8253
12.2 MB Preview Download
md5:f062d92478bea7aa1b67d280b351be95
13.7 MB Preview Download
md5:394b47f014d27ea9a65466bbb79e3ce5
6.6 MB Preview Download
md5:8767e51732b06318c5f799bbd11ee077
84.6 MB Preview Download
md5:1b936ecec515326b8d9cc5ff1f0112f3
712.3 kB Preview Download
md5:2e525b5596cf9a1c738affe5aa0153a6
78.8 kB Download
md5:2b1faac532b9989a4575b40327790fa7
74.2 kB Download

Additional details

Related works

Is cited by
10.1093/sysbio/syv058 (DOI)