Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published December 4, 2018 | Version v1
Dataset Open

Data from: A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering

  • 1. Texas Tech University
  • 2. Royal Botanic Gardens
  • 3. Plant Science and Conservation, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL 60022, USA*
  • 4. University of Georgia
  • 5. University of Florida
  • 6. University of Alberta
  • 7. Northwestern University

Description

Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, five to 15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself.

Notes

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: DEB-1239992

Files

onekp_only_angios_degapped.zip

Files (684.5 MB)

Name Size Download all
md5:686435057042f7aec2d70815cae5e130
7.4 MB Download
md5:099aae88d79e4a62b4ccb170e3b74386
49.6 MB Preview Download
md5:6e512bb9d782eeb6fc61a429f2598a50
627.4 MB Preview Download
md5:5648a9b3b9c42c9a821c4b709a975af6
35.1 kB Preview Download
md5:157c2e01050d7b843bc7eb973eca6b10
99.9 kB Download
md5:a4ee8b3e53fce90f3d4e0d0ff3b15ac3
39.1 kB Download
md5:7850774e4b3401390564860d25807fb7
18.7 kB Download

Additional details

Related works

Is cited by
10.1093/sysbio/syy086 (DOI)