A multitiered sequence capture strategy spanning broad evolutionary scales: Application for phylogenetic and phylogeographic studies of orchids

With over 25,000 species, the drivers of diversity in the Orchidaceae remain to be fully understood. Here, we outline a multitiered sequence capture strategy aimed at capturing hundreds of loci to enable phylogenetic resolution from subtribe to subspecific levels in orchids of the tribe Diurideae. For the probe design, we mined subsets of 18 transcriptomes, to give five target sequence sets aimed at the tribe (Sets 1 & 2), subtribe (Set 3), and within subtribe levels (Sets 4 & 5). Analysis included alternative de novo and reference‐guided assembly, before target sequence extraction, annotation and alignment, and application of a homology‐aware k‐mer block phylogenomic approach, prior to maximum likelihood and coalescence‐based phylogenetic inference. Our evaluation considered 87 taxa in two test data sets: 67 samples spanning the tribe, and 72 samples involving 24 closely related Caladenia species. The tiered design achieved high target loci recovery (>89%), with the median number of recovered loci in Sets 1–5 as follows: 212, 219, 816, 1024, and 1009, respectively. Interestingly, as a first test of the homologous k‐mer approach for targeted sequence capture data, our study revealed its potential for enabling robust phylogenetic species tree inferences. Specifically, we found matching, and in one case improved phylogenetic resolution within species complexes, compared to conventional phylogenetic analysis involving target gene extraction. Our findings indicate that a customized multitiered sequence capture strategy, in combination with promising yet underutilized phylogenomic approaches, will be effective for groups where interspecific divergence is recent, but information on deeper phylogenetic relationships is also required.

that the evolution of pollinia, the epiphytic habitat, CAM photosynthesis in epiphytes, and tropical distributions are major contributors to the diversification of the family (Givnish et al., 2015(Givnish et al., , 2016. However, these factors are unlikely to fully explain local patterns of diversification. For example, neither epiphytism, nor a tropical distribution, can explain the rich terrestrial orchid flora of South Africa (>500 spp.) or temperate southern Australia (>1500 spp.) (Backhouse et al., 2019;Johnson et al., 2015). Instead, localized pollinator-driven speciation may be an important contributor to diversification in these orchid floras (Givnish et al., 2015;Van der Niet et al., 2014;Peakall et al., 2010). It follows that to more fully address the question of the drivers of species diversity across the Orchidaceae, investigations spanning different taxonomic and geographic scales will be required.
Recent initiatives such as the 1000 Plant Genomes Project (1KP) have generated a wealth of transcriptome data spanning >1000 plants, aiding both our understanding of the evolution of green plants , and providing relevant sequences for the construction of a universal set of target enrichment probes for angiosperms . Existing orchid genomic and transcriptomic resources also have the potential to be mined for phylogenetic and evolutionary investigations. For example, Zhang et al. (2017) utilized DNA sequence variation across 132 single-copy gene families identified in the draft genome of Apostasia shenzhenica (and confirmed across 14 other plant species), to build a high-confidence phylogenetic tree of 12 representative species spanning the five subfamilies of the Orchidaceae.
Drawing on whole transcriptomes of eight Cypripedium species, Guo et al. (2018) were able to clarify phylogenetic relationships among six closely related species. Unruh et al. (2018) further confirmed the feasibility of using transcriptomes to investigate the phylogenetic relationships among five Cypripedioideae genera and to anchor the phylogeny within the Orchidaceae. In addition, previously uncertain phylogenetic relationships among some species of Ophrys and Gymnadenia orchid have also been clarified by the phylogenomic analysis of floral transcriptomes (Piñeiro Fernández et al., 2019).
Notwithstanding these few examples of orchid phylogenetic studies that have utilized genomic and transcriptomic resources, studies involving hundreds of samples appear to be lacking. NGSbased targeted sequence capture, also called hybrid enrichment, may offer one potential solution to achieve such a goal. The potential benefits of targeted sequence capture have been advocated in several reviews (e.g., de La Harpe et al., 2019;Grover et al., 2012;Jones & Good, 2016;Lemmon et al., 2012;Lemmon & Lemmon, 2013;Léveillé-Bourret et al., 2017;McCormack et al., 2013). Some key benefits include the flexibility of target selection (e.g., type, size, and numbers of regions targeted), the potential to recover phylogenetically informative content for enriched F I G U R E 1 The availability of genomic and transcriptomic resources mapped onto a schematic phylogeny of the five Orchidaceae subfamilies. Species diversity is based on the estimated number of species provided in Chase et al. (2015). See the main text for the supporting literature on the various orchid genomes. The transcriptome data is based on publicly available data sets deposited in NCBI Sequence Reads Archive (last accessed 01 February, 2020) loci across wide phylogenetic depth and breadth, and missing data are often minimized relative to other methods. However, to fast-track a targeted sequence capture design, one potential constraint for many studies is the need for robust sequence knowledge, typically drawn from existing genomic/transcriptome resources.
We have generated transcriptomic resources as part of our ongoing research on the biochemistry and molecular biology of Australian terrestrial orchids (Wong et al., 2018Wong, Amarasinghe, et al., 2017;Xu et al., 2017). Using these and additional targeted transcriptomes, our overarching objective was to develop customized NGS-based sequence capture for phylogenetic studies of Australian terrestrial orchids. Here, we outline the design and implementation of a multitiered sequence capture strategy for Australian terrestrial orchids belonging to the Diurideae (with >900 species). Our goal was to simultaneously capture in a single hybridization step hundreds of informative loci that collectively enable the phylogenetic resolution of relationships spanning among subtribes, through genera and species, to phylogeographic structure within species. We also evaluate the sequencing capture success and the preliminary phylogenetic outcomes. Our evaluation considers 134 samples, comprising 87 taxa across two test data sets: (i) A set of 67 samples spanning the Diurideae, and an outgroup, to assess informativeness across the entire tribe (D67). (ii) From within the species-rich subtribe Caladeniinae, a set of 72 Caladenia samples comprising 24 species, with duplicates for each species, and for some species additional samples from geographically distinct sites (C72) (Table S1).
We further consider the sequence capture success and the congruence of the phylogenetic results obtained via two widelyadopted approaches: de novo (Jones & Good, 2016;McCormack et al., 2013) and reference-guided (Johnson et al., 2016) assembly of target/exon sequence reads followed by sequence annotation and alignment. We also tested the application of a promising, yet underutilized, phylogenomic approach based on homologous substrings of sequence (k-mer) (Sanderson et al., 2017). Gene and species trees were generated by maximum likelihood in IQ-TREE 2 (Minh, Schmidt, et al., 2020) with branch support estimated by ultrafast bootstrap, and topological variation measured by gene concordance and site concordance factors (Minh, Hahn, et al., 2020).
Complementary multispecies coalescence-based species tree inference was performed in ASTRAL III . Finally, we conclude with a brief discussion of the phylogenetic insights uncovered for the study species, and conclude with consideration of the lessons learnt for developing customized sequence capture designs.
TA B L E 1 The species, including tribe, and subtribe within the Orchidoideae, for which transcriptome sequencing was used to obtain coding DNA sequences (CDS) later mined for target gene sets Transcriptomes of the core species were used as a source of target gene sequences in two or more target gene sets. c Alternative generic nomenclature as used in Weston et al. (2014).

| Background to the study system
Australia exhibits a rich diversity of orchids, with a current estimate of more than 1860 species (Backhouse et al., 2019). The majority of Australian terrestrial orchids belong to the Diurideae, with conservative estimates of between 900 and 1000 species in this tribe (Chase et al., 2015;Weston et al., 2014). Within the Diurideae, the Caladeniinae represents the largest subtribe, with estimates of the number of species varying from 297 worldwide (Chase et al., 2015), to 425 species and subspecies in Australia and Island Territories (Backhouse et al., 2019). The genus Caladenia sensu lato accounts for the largest share of the species in the subtribe (e.g., Backhouse et al. (2019) list 382 taxa). Please see Methods S1 for more details.

| Transcriptome resources and sampling
With a focus on the Australian terrestrial orchids of the Diurideae, our goal was to obtain phylogenetic resolution spanning subtribes to within species levels. To ensure wide applicability, additional transcriptomes, beyond those already in hand (Wong, Amarasinghe, et al., 2017;Wong et al., 2018Wong et al., , 2019Xu et al., 2017), were generated. All raw sequence reads have been deposited in NCBI Sequence Read Archive (http://www.ncbi. nlm.nih.gov/sra) under the BioProject and SRA study accession of PRJNA661963 and SRP281918, respectively. The choice of species to target for these extra transcriptomes was guided by the inferred, albeit often uncertain, phylogenetic relationships indicated by Weston et al. (2014). (See Figure 2 and Methods S1 for more details).

| Target sequence sets
The final set of species for which transcriptomes were available or have been generated, is shown in Table 1 and Figure 2. Table S1 provides further details of the associated sample source, location, and voucher specimen details. Based on our objectives, and transcriptome availability, we identified five different target sequence sets for use in the final probe design (Tables 1 and 2). It is important to stress that while we will often refer to each of these target sequence sets individually, in practice a single DNA probe set was used in the hybridization step to simultaneously capture the target loci across all target sequence sets. Below we briefly outline the purpose of each sequence set:

| Target sequence set 1 -Diurideae
Genes matching loci from Deng et al. (2015), who identified 315 putative orchid wide single locus orthologues, were the target of this set, drawing on sequences from our 11 "core" transcriptomes ( Figure 2 and Table 1). Target genes were located using the Deng et al. (2015) multispecies orthologue sequences data as the BLAST query. Set 1 ultimately targeted 230 loci ( Table 2). The remaining 85 orthologues failed to pass the manual alignment checking phase, with many deemed to pose a high paralogue risk, as indicated by alignments containing evidence for gene duplicates. While in this study the target group was limited to the tribe Diurideae and outgroup, the strategic inclusion of these target genes was intended to allow data sharing among studies and allow use beyond Australian orchids.

| Target Sequence Set 2: Diurideae
To supplement sequence Set 1, additional target genes common across our 11 core transcriptomes spanning the Diurideae and outgroup ( Figure 2

| Target Sequence Set 3: Other Diurideae
The goal of this set was to expand the number of genes targeting the Acianthinae, Diuridinae, Cryptostylidinae, Thelymitrinae, Megastylidinae, and Prasophyllinae, for which phylogenetic relationships at the tribe level were not fully resolved by Weston et al. (2014) (Figure 2, Table 1). Reciprocal BLAST across six of the 11 core transcriptomes, each representing one of these subtribes, yielded a final set of 916 gene loci (Table 2).

| Target Sequence Set 4: Drakaeninae
To obtain a sequence set aimed at the Drakaeninae, reciprocal BLAST was performed across six transcriptomes representing three of the 11 core transcriptomes, plus three additional transcriptomes. Collectively, the transcriptomes covered the genera Caleana, Chiloglottis, and Drakaea as representatives of the subtribe, and Rimacola as the outgroup (Figure 2). This yielded a final set of 1,058 gene loci ( Capture statistics are based on the Illumina sequencing outcomes following the targeted enrichment of the sequencing libraries using the custom SeqCap EZ Developer Library probes for one representative sample of each species used in the original gene set. For example, the Set 1 analysis is based on targeted enrichment of the sequencing libraries representing the same 11 core transcriptome species (see Table 1). Data are shown for the outcomes obtained via the analysis Option 1 (see Figure 3), under high stringency parameters. c The "Other Diurideae" encompass the subtribes Acianthinae, Diuridinae, Cryptostylidinae, Thelymitrinae, Megastylidinae, and Prasophyllinae. The relationships among these subtribes remain somewhat uncertain within the Weston et al. (2014) phylogeny used here as the working hypothesis of possible subtribe and generic relationships within the tribe Diurideae (see also Figure 2 and Table 1). d See Figure 2 for the provisional phylogenetic placement of the outgroup taxa, as indicated by the phylogenetic analysis of Weston et al. (2014 Diuris and Microtis was also attempted, but yielded fewer target genes (full data not shown).

| Sequence selection for target sequence sets
Our methods for obtaining the transcriptomes follow methods we have already published (Wong et al., 2018;Wong, Amarasinghe, et al., 2017). Following the assembly of paired-end reads, trimming and conversion to CDS sequences, customized scripts taking ad-

| Probe design
This phase of the project started with the submission of the target gene sequences for each of the five sequence sets to Roche NimbleGen. Subsequently, SeqCap EZ Developer Library probe design was performed by Roche NimbleGen using proprietary algorithms. These algorithms also took full advantage of the orchid genomes of Phalaenopsis equestris (Cai et al., 2015) and Dendrobium catenatum (Zhang, Xu, et al., 2016). Following our review and approval of the probe design, as the customer, probe manufacture and delivery followed.

| DNA extraction, library preparation, and Illumina sequencing
Samples were sourced from both wild and cultivated plants (Table   S1), with DNA extracted using the Qiagen DNeasy Plant mini kit (Cat. No. 69106) following the manufacturer's protocol. Comprehensive details on the library preparation to Illumina sequencing steps of: (i) library preparation, (ii) hybridization, (iii) capture and washing, (iv) enrichment, indexing and quality control, (v) preparation for multiplexed Illumina DNA sequencing, and (vi) initial post sequence processing, are provided in the Methods S1.

| Option 1-Target sequence assembly and extraction
Following the initial post-sequencing processing and quality control, the surviving sequence capture reads of each species were assembled using Trinity v2.6.5 with fastq format as input (--seqType fq), k-mer size of 25 (--KMER_SIZE 25), and the read normalization option disabled (--no_normalize_reads) (Grabherr et al., 2011). Post assembly, to extract DNA sequences for the downstream phylogenetic analysis, we employed customized Bioperl scripts and Standalone BLAST+ in a similar way to that used in sequence selection for the target sequence sets (see Methods S1). Before phylogenetic analysis, the automated alignments were carefully checked by eye in geneious V10 (Kearse et al., 2012). For a small fraction of the alignments (<2%), manual adjustments and/or locus exclusion, as deemed necessary, was performed at this stage. We also considered the outcomes of an independent assessment of paralogue risk loci via Option 3 (see below), leading us to exclude several additional loci not already excluded by our manual checks.

| Option 2-Homologous k-mer block discovery
Following trimming and de novo Trinity-assembling as described (see also Methods S1 for further details).

| Option 3-HybPiper target sequence assembly and extraction
For comparison with analysis Options 1 and 2 (Figure 3), the standard HybPiper (Johnson et al., 2016), coupled with an automated phylogenomic data preprocessing pipeline was performed. We used the recommended (default) settings and the multiple relevant reference sequences (e.g., Set 1-2, 11 species; Set 3-4, six species, Set 5, seven species) for each target sequence set. As already noted, we also used F I G U R E 3 A diagram showing the three parallel pathways, analysis Options 1 to 3, employed in this study for the sequence capture assembly and target exon (and intron) extraction for subsequent phylogenetic analysis [Colour figure can be viewed at wileyonlinelibrary.com] the HybPiper pipeline to identify loci with high paralogue risk, with this analysis performed across the 67 Diurideae wide samples (D67).
By our criterion of requiring more than six of the 67 samples (>10%) at a locus to be flagged as potentially containing paralogues, very few paralogue risk loci were detected. See the Methods S1 for further details on the HypPiper workflow and results, and the Dryad repository (https://doi.org/10.5061/dryad.z08kp rrbj) for additional paralogue analysis.

| Phylogenetic analysis
To evaluate the phylogenetic utility of the sequence capture strategy, we conducted phylogenetic analysis of two test data sets with partially overlapping samples, as previously outlined above. In total, these phylogenetic analyses involved 134 samples, comprising 87 taxa. The D67 sample set, comprising 67 samples spanning the Diurideae, and an outgroup, was designed to assess phylogenetic informativeness across the entire tribe. Our C72 sample set, comprising 72 samples of Caladenia from subgenus Calonema, and C. denticulata from the subgenus Phlebochilus as the outgroup, was intended for exploratory analysis of sequence Set 5, targeting the Caladeniinae. One basis for sample inclusion in C72 was the requirement that two or more samples per species were successfully sequenced. Given a lack of species-level resolution in previous molecular studies Swarts et al., 2014), we reasoned that a first test of the effectiveness of Set 5 would be if the phylogenetic analysis could group individuals of the same species together. We also included examples of species that, based on morphology, appear to be closely allied as members of "species complexes" but can still be confidently distinguished in the field (Brown et al., 2013). In the absence of morphological convergence, we predicted that taxa within species complexes should group together. To assess whether we could obtain a phylogeographic signal, for some species we also included replicate samples drawn from geographically distinct sites (Table S1).
For the purpose of this study, most phylogenetic inferences were based on a concatenated data set for each sequence set and data source (i.e., Option 1 exon and intron, as well as Options 2 and 3) to build a supermatrix of sequences. For analysis Option 1 data sets ( Figure 3), this followed thorough manual checks of alignments at the locus level. For Options 2 and 3, only cursory checks were made so as to provide a test of the fully automated processes provided by these analysis pathways.
Phylogenetic gene trees and species trees based on the concatenated supermatrix were inferred by Maximum Likelihood analysis in iq-tree 2 (hereafter simply IQ-TREE) (Minh, Schmidt, et al., 2020;Nguyen et al., 2015). For in-frame exon data, the best substitution model for each locus was automatically selected by ModelFinder within IQ-TREE according to the Bayesian Information Criterium Subsequently, gene concordance (gCF) and site concordance (sCF) factors were estimated within IQ-TREE using the single locus (gene) and concatenated (species) tree as input (Minh, Hahn, et al., 2020).
IQ-TREE is currently the only package available to estimate gCF while taking into account unequal taxon sampling, and provide the new concordance measure sCF (Minh, Schmidt, et al., 2020). The gene concordance factor (gCF) represents the percentage of gene trees containing that branch, while the site concordance factor (sCF) represents the percentage of parsimony informative sites supporting a branch. These concordance factors complement bootstrap estimates by offering additional information about the underlying variability in the data (Minh, Hahn, et al., 2020). For the in-frame exon data, the ML gene trees generated via IQ-TREE were also used as inputs into ASTRAL III (hereafter simply ASTRAL), to generate an alternative multispecies coalescent-based estimate of the species tree, with support estimated using local posterior probabilities with default parameters . For the intron, k-mer (Option 2), HybPiper (Option 3) and combined data sets, a supermatrix was created before selection of the substitution model and IQ-TREE inference with ultrafast bootstrap.

| Comparison of target and capture statistics
A summary of key target sequence assembly statistics for the 67 samples spanning the Diurideae (D67) and the 72 Caladenia samples (C72) is shown in Figure S1. For example, the median number of paired-end reads (trimmed and quality filtered) across our D67 target sequence assembly was approximately 636,000. Additionally, the median total number of assembled contigs across the 67 species was approximately 46,000 with respective median, average, and N50 contig lengths of 287, 416, and 471, respectively. in the original design. For example, the Set 1 analysis was based on the sequencing outcomes for 11 samples each representing one of the core transcriptome species (see Table 1 for species involved).
The results show very high capture rates averaging 85% of the target sequence for Sets 1 and 2, and an average of 73% coverage for Sets 3 to 5 (  Figure 3), with the Set relevance indicated for each species (see Figure 2). Please refer to where they range between 228-283 and 3.9-4.8, respectively (Cai et al., 2015;Chao et al., 2018;Zhang et al., 2017;Zhang, Xu, et al., 2016).
The capture statistics shown in Table 2, which were achieved via analysis Option 1, cannot be directly compared with outcomes achieved via Options 2 and 3, because different methods and parameters are used. Nonetheless, it is of interest to compare the number of loci obtained by species for the three analysis options, across each sequence set. Figure 4 shows the number of loci recovered for each of the target sequence Sets 1 to 5 across 18 genomic samples representing the transcriptomes used in the sequence capture design (with replicates shown for two taxa). As expected, the number of loci recovered was more uniform across the test samples for Sets 1 and 2, aiming across the entire Diurideae, than for Sets 3-5, which aimed for phylogenetically narrower subsets of the tribe.
In a more extensive analysis, based on 67 samples spanning the tribe Diurideae (D67), box plots of the number of loci successfully captured are shown in Figure 5. This expanded analysis is further split into target (i.e., species belonging to the intended phylogenetic range as outlined above) and non-target species for Sets 3-5 (See also Figure S2 for heat maps of the number of loci, exons and length of sequence recovered for each D67 sample and Set 1-5 report on the phylogenetic analysis that follows will primarily focus on the outcomes arising from analysis Options 1 and 2. Three major clades are indicated:

| Outcomes of phylogenetic analysis across the Diurideae
Clade 1 comprising subtribe Prasophyllinae and represented by three samples drawn from two genera, Microtis and Prasophyllum.
Virtually all nodes had 100% bootstrap support, with a notable exception of lower support for the split of clades 2 and 3 ( Figure 6a).
Note that with the ultrafast bootstrap method of IQ-TREE (Hoang et al., 2018;Minh, Schmidt, et al., 2020), we consider strong support when bs ≥95. However, we show bootstrap values and concordance factors (where applicable) for bs ≥70, on our figures in order to provide more information on the uncertainty. As expected theoretically (Minh, Hahn, et al., 2020), across the entire D67 phylogeny the gene concordance factor (gCF) and site concordance factor ( (2020) for phylogenies from large data sets. The equivalent IQ-TREE phylogenies obtained via analysis Options 2 and 3, revealed highly comparable topologies as evident in the tangle plots shown in the inserts, respectively. These tangle plots reveal only one minor difference in tree topology between analysis Option 1 and 2, involving two closely related Drakaea species (Figure 6b,c, see also Figure S5 for full-size trees, and Figure S6 for full-size tangle plots).
Across the D67 samples almost identical groupings of species and genera into the same three major clades, were obtained via IQ-TREE for the Option 1 sequence Sets 1 to 3 and via ASTRAL (c.f. Figure 6a with Figure S7 and S8). However, in all cases there is considerable uncertainty about the precise relationships among the three major clades, with these adjoining nodes being characterized by very short branches, and low bootstrap and local posterior probability support values. Concordance factors were also low at these nodes. For example, a gCF value of 19.1, and sCF value of 33.5, was found at the node joining Clade 1, with Clades 2 and 3 in Figure 6a. This sCF value is very close to the expected minimum of 33, indicating there is no consistent information in the alignment concerning the relationships among clades 1, 2 and 3 (Minh, Schmidt, et al., 2020). The ASTRAL results reinforce these findings with a low local posterior probability of 0.65 on the branch joining clades 2 and 3 ( Figure S8).

| Outcomes of phylogenetic analysis in Caladenia
A summary of key target sequence assembly statistics for the 72 Caladenia orchid samples (C72) is shown in Figure S1. For example, the median number of paired-end reads (trimmed and quality filtered) used for the target sequence assembly of the C72 sample set was approximately 1,236,000. Additionally, the median total number of assembled contigs across the 72 species was approximately 53,000 with respective median, average, and N50 contig lengths of 287, 430, 502.  Table S1 for site locations and species codes). Notably, these phylogeographic groupings within species tended to show higher gCF and sCF values than the average within taxon values (see Figure 7a).  Figure S10 for intron based tree, Figures S11-S13 for full-size tangle plots, and Figure   F I G U R E 6 An IQ-TREE phylogeny of 67 samples spanning the Diurideae and outgroup Pterostylis taxa. The phylogeny is based on the coding exon sequences at 221 loci targeted by sequence Set 2 aiming to cover the Diurideae (alignment length = 250,431 bp, with 81,989 parsimony-informative and 26,241 singleton sites). Branch labels show ultrafast II bootstrap (bs) values, gene concordance factors (gCF) and site concordance factors (sCF), respectively, as estimated in IQ-TREE 2 (Minh, Schmidt, et al., 2020), when bs >70. Clades 1 to 3 are colour coded with a thumbnail orchid flower image illustrating one representative of the clade: Prasophyllum elatum, Diuris recurva and Caladenia plicata, respectively. Illustrative numbered images, one species per subtribe, are shown with corresponding positions marked on the phylogeny: 1. Microtis parviflora, 2. Drakaea glyptodon, 3. Leporella fimbriata, 4. Thelymitra crinita, 5. Cryptostylis leptochila, 6. Diuris orientis, 7. Elythranthera brunonis, 8. Corybas hispida and 9. Pterostylis nutans. The aligned sequences for the phylogenetic analysis were obtained via analysis Option 1, after exclusion of exons that failed to be fully sequenced for 50% or more of the samples. The inserts show tangle plots comparing Figure 6a (on the left) with that generated via the k-mer analysis Option 2 (Figure 6b), and the HybPiper analysis Option 3 ( Figure  6c), respectively. Full-size figures of inserts b and c, and the supporting phylogenetic trees are provided in the Supporting Information. Scale bar indicates nucleotide substitutions per site. Please refer to Table S1 for the full species names. Photographs 1, 6 and 8 by Tobias Hayashi, all others by Rod Peakall [Colour figure can be viewed at wileyonlinelibrary.com] S14). From these plots it is evident that the k-mer based tree showed broad congruence in topology with both the exon and intron trees.
The few differences evident in the tangle plots either comprised: (i) subtle difference in relationships among samples within a species (e.g., C. attingens), or (ii) variation in the precise relationships among species, but with paired replicates always clustering together (e.g.,

F I G U R E 7
An IQ-TREE phylogeny of 72 Caladenia orchid samples, including replicates within species, from the subgenus Calonema with Caladenia denticulata from the subgenus Phlebochilus as the outgroup. The phylogeny is based on the coding exon sequences at 719 loci targeted by sequence Set 5 aiming to cover the Caladeniinae (alignment length = 755,184 bp, with 33,191 parsimony-informative and 22,654 singleton sites). Branch labels show ultrafast II bootstrap (bs) values, gene concordance factors (gCF) and site concordance factors (sCF), respectively, as estimated in IQ-TREE 2 (Minh, Schmidt, et al., 2020), when bs >70. The multiple species of the morphologically defined species complexes are colour coded with a thumbnail orchid flower image illustrating one representative of the complex: Caladenia pectinata, C. longicauda, C. rhomboidiformis, respectively. Three examples where phylogeographic structure was indicated, are also colour coded, with the distance between collection sites shown in km. Note that C. longicauda subsp. insularis is represented by just one sample (CALA05), so is not labelled, but falls with the duplicates of subspecies crassa and rigidula. The aligned sequences for the phylogenetic analysis were obtained via analysis Option 1, and contain no missing loci for any sample, demonstrating the high sequencing coverage possible with probe Set 5. The inserts show tangle plots comparing Figure 7a (on the left) with the phylogeny generated via the k-mer analysis Option 2 (Figure 7b), and the phylogeny generated for the matching Set 5 intron sequences (Figure 7c), respectively. See Figure 8 for the full-size phylogeny based on the k-mer analysis. Full-size figures of the inserts B and C, and the supporting phylogenetic tree for C, are provided in the Supporting Information. Scale bar indicates nucleotide substitutions per site. Please refer to Table S1 for (Figure 8a).

F I G U R E 8 An IQ-TREE phylogeny for 72
Caladenia orchid samples including replicates within species, from the subgenus Calonema with Caladenia denticulata from the subgenus Phlebochilus as the outgroup. The phylogeny is based on the k-mer analysis Option 2 (total alignment = 2,482,350 bp, with 296,534 parsimony-informative and 291,095 singleton sites, hakmer parameters k = 25, c = 18, q = 2, w = 25). Branch labels show ultrafast II bootstrap (bs) values as estimated in IQ-TREE 2 (Minh, Schmidt, et al., 2020), when bs >70. See Figure 7 caption for colour coding details. The inserts show tangle plots comparing Figure 8a with the phylogeny generated via Option 1 for the sequence Set 5 exons (Figure 8b), and the phylogeny generated for the matching Set 5 intron sequences (Figure 8c), respectively. See Figure  7 for the full-size phylogeny based on the exon analysis. Full-size figures of the inserts b and c, and the supporting phylogenetic tree for c, are provided in the Supporting Information. Scale bar indicates nucleotide substitutions per site. Please refer to
Furthermore, this apparent delay in wide uptake seems to be despite: (i) the availability of informative Angiosperm wide probe sets as a promising starting point (Bogarín et al., 2018;Buddenhagen et al., 2016;Johnson et al., 2019;Léveillé-Bourret et al., 2017;Mendoza et al., 2020) and (ii) increasing availability of "off the shelf" bioinformatic solutions and pipelines, such as HybPiper designed initially for plant sequence capture studies, but more broadly applicable to any group of organism (Johnson et al., 2016). This may suggest that for many plant families the lack of relevant genomic and transcriptomic resources has been perceived as a key impediment to building a custom sequence capture strategy for specific plant groups of interest.
On the other hand, within the diverse Orchidaceae, despite the availability of a growing number of genomic and transcriptomic resources (Figure 1), the uptake of sequence capture also appears to be rare. Mendoza et al. (2020)  set. Chloroplast targeted sequence capture has also been used effectively to investigate the degree and rate of plastome degeneration in a complex of Corallorhiza heterotrophic orchids (Barrett et al., 2018).
The goal of this present study was to develop, implement, and evaluate a multitiered target sequence capture strategy for orchid phylogenetic studies spanning from subtribe through to phylogeographic analysis within species. Our target group was the rich terrestrial orchid flora belonging to the Diurideae within the Orchidoideae, with >900 species. At the onset of our project in 2015, no orchid genomes were available, and currently, there is still no genome for any species within the Orchidoideae (see Figure 1). Nonetheless, we were able to draw on our existing transcriptomes generated for other research purposes (Wong et al., 2018;Wong, Amarasinghe, et al., 2017;Xu et al., 2017). To extend the applicability of our sequence capture design, we also completed additional transcriptome sequencing of a Pterostylis (Cranichideae) as the outgroup, and additional representatives of other Diurid subtribes ( Figure 2, Table 1).
The inclusion of Pterostylis is expected to extend applicability to the more than 350 species in that genus (Backhouse et al., 2019). Our Set 1, targeting putative Orchidaceae wide orthologues (Deng et al., 2015), should also allow the construction of phylogenies for a common set of orthologous loci beyond the target group, with the caveat that capture success is expected to decline with increasing phylogenetic distance. Thus, despite our prime focus on the Diurideae, we expect our sequence capture strategy will offer wider applicability.
Our evaluation of sequence capture success confirmed that the goal of a wide phylogenetic span was broadly achieved. For example, the phylogeny across the D67 test samples shown in Figure 6a is

| An evaluation of homologous k-mer block phylogenies from target sequence capture derived data sets
With the increasingly widespread use of phylogenomic data sets  (Bernard et al., 2019). This is in part due to the inherent difficulties of scaling up standard phylogenetic procedures to gigabase scales, but is also due to complications in sequence annotation, orthologue detection, and sequence alignment, among others (Sanderson et al., 2017). In this context, the use of k-mers (i.e., substrings of length k in a given biological sequence), is gaining popularity in phylogenomic analysis workflows as an independent way to infer phylogenies based on concatenated "supermatrices" and/or sets of "gene trees" (Bernard et al., 2019;Bertels et al., 2014;Fan et al., 2015;Leimeister & Morgenstern, 2014;Sanderson et al., 2017). Sanderson et al. (2017) recently assessed the utility of homologous k-mer blocks across seven gigabase scale data sets (e.g., nuclear genomes and transcriptomes), confirming the potential to infer robust phylogenies across tens to hundreds of species spanning wide taxonomic breadths. By way of example using just transcriptomes, a fully resolved tree of 67 Caryophyllales species was obtained using Hakmer (with parameters k = 28; N min = 15 of 67 taxa; and q = 2).
Furthermore, the k-mer-based tree showed agreement at every node, except one, with the published tree obtained via the conventional approach (based on 1,122 putative orthologous loci shared among the Caryophyllales transcriptomes ).
To our knowledge, to date, no study appears to have applied homologous k-mer blocks to the analysis of sequence capture data for phylogenetic inference. Indeed, the need to apply a k-mer approach to such data may at first seem counterintuitive, since by design the target loci are known. Here, we considered the utility of this approach in three ways: (i) As part of an evaluation and validation of our sequence capture assembly and extraction strategies (analysis Options 1 and 3). (ii) As a way to more fully utilise both the target and off-target sequence data obtained by sequence capture. (iii) To test an alternative, complementary, and potentially faster way to enable species tree inferences.
To this end, we used Hakmer (Sanderson et al., 2017) to first identify homologous k-mer blocks within de novo assembled target sequence capture reads (analysis Option 2), before using the concatenated "supermatrix" of k-mer blocks in the same type of IQ-TREE analysis as applied in Options 1 and 3. While analysis Option 1 and 3 took weeks, this approach took just days. It also sidestepped the arduous annotation, alignment, and quality checks that in our view are critical components of analysis Options 1.
Three key outcomes emerged from our evaluation of the homol-  Figures 7a and 8a). This may be attributable to the addition of informative off-target homologous sequences, as observed in a recent sequence capture study of Epidendrum orchids (Mendoza et al., 2020).
Thus, our preliminary evaluation of the application of homologous k-mer blocks approach to targeted sequence capture data revealed its potential to recover phylogenetic species tree inferences matching conventional analysis involving target gene extraction, alignment, and supermatrix phylogenetic analysis. This was also achieved with dramatically increased speed and simplicity of data analysis, over analysis Options 1 and 3. Furthermore, the inferred k-mer-based phylogeny appeared to better group morphologically similar members of species complexes. At the same time, consistent with the collective evidence, the polyphyly of some taxonomic entities was indicated.

| New phylogenetic insights
Our prime objective in this method focused paper was to evaluate the performance of our multipurpose targeted sequence capture strategy, rather than undertaking a comprehensive phylogenetic analysis. With an expanded data set, such analyses will be the subject of subsequent studies. Nonetheless, our exploratory phylogenetic analysis based on concatenated captured gene sequences from a set of 67 samples spanning the Diurideae and outgroup is of interest ( Figure 6). While broadly recovering similar groupings of species and genera at the subtribe levels to Weston et al. (2014) (c.f. Figures   2 and 6), our results indicate better resolution, particularly among closely related genera (e.g., among genera within the Drakaeinae, Megastylidinae, and Caladeniinae, see Clements et al., 2015;Weston et al., 2014)). However, our inability to resolve the relationships between the three major clades comprising the Prasophyllinae as Clade 1, Clade 2 as (Diuridinae(Cryptostyli dinae (Thelymitrinae(Megastylidinae(Drakaeinae))))) and Clade 3 as (Acianthinae(Caladeniinae)), mirrors earlier findings (Clements et al., 2002;Kores et al., 2001;Weston et al., 2014, Figure 2), despite the very substantial increase in the number of loci. Furthermore, this phylogenetic pattern was found across the Option 1 to 3 IQ-TREE phylogenies, across sequence Sets 1 to 3 (e.g., Figure 6, Figures S5 and S7), and in the ASTRAL analysis ( Figure S8). As already noted, very short branches, and low gCF and sCF scores characterize these nodes (Figure 6a). Unfortunately, low gCF scores can be due to either strong gene tree discordance or weak phylogenetic signal (Minh, Hahn, et al., 2020), which can be difficult to tease apart. Campanulaceae (Bagley et al., 2020), but see both these studies for discussion on possible methodological and phylogenetic analysis solutions towards solving this problem. In common with our observations in orchids, these problematic backbone nodes are characterized by short branches, low bootstrap support, and high levels of gene tree discordance. This discordance may be due to several factors including incomplete lineage sorting (ILS), short times between divergences, and use of loci with low phylogenetic information. Notwithstanding these phylogenetic analysis challenges, collectively, our findings indicate that with expansion to additional samples, and expanded locus filtering (e.g., to remove loci with low phylogenetic information [Burbrink et al., 2019]) prior to further detailed phylogenetic analysis, this data set provides a valuable resource to gain insights into the phylogeny of the Diurideae.
While a more comprehensive phylogenetic analysis of Caladeniinae will follow, some new insights are already apparent from our analysis of the 72 Caladenia samples comprising 24 species. As already noted, we included both replicate samples per taxon (species or subspecies), and multiple examples of species that, based on morphology appear to be closely allied as members of "species complexes", but can still be confidently distinguished in the field.
While in most phylogenetic analyses (Set 5 exons- Figure 7a, kmer- Figure 8a, ASTRAL- Figure S9, introns- Figure S10), replicate samples grouped together, only in the k-mer based phylogeny did all members of the C. huegelii and C. longicauda species complexes, each fall into their own well supported clades. Across all analyses C. longicauda redacta did not group with the other subspecies. However, it did group consistently with Caladenia uliginosa, another species within the C. longicauda complex (Figures 7-8, Figures S9 and S10).
Thus, as presently defined, based on morphology, the subspecies of Caladenia longicauda appear to be polyphyletic. Furthermore, there is also evidence for deeper phylogenetic structure within the more broadly defined C. longicauda species complex. With a total of 14 named Caladenia longicauda subspecies (Brown & Brockman, 2015), and more than 11 other morphologically similar species, it is evident that a more in depth phylogenetic analysis is not only warranted, but now feasible.
Nonetheless, resolving relationships among other closely related Caladenia species will probably remain an ongoing biological and technical challenge. This is evident from the trend of lower-thanaverage support values on the nodes joining taxa (species and subspecies) within the major clades (Figures 7 and 8, Figure S9-S10, S14) for all of the following measures: bootstrap, local posterior probabilities, gCF and sCF scores. It is likely that many of the clades within Caladenia have diversified rapidly, with much incomplete lineage sorting, and potentially with ongoing hybridization. However, as we have proposed for the Diurideae, smart locus filtering informed by more detailed analysis of phylogenetic informativeness and additional insights from gene and site level concordance analysis, may hold promise for improving resolution.

| Future phylogenetic directions
We have yet to evaluate how the IQ-TREE and ASTRAL species trees reported in this study will compare with the trees obtained after compositional heterogeneity testing (exon data sets), other forms of locus filtering, and with other coalescence-based phylogenetic analysis. We do note that target sequence capture is particularly well suited to a wide range of phylogenetic methods, given that in-frame exon regions are readily available. Not only do in-frame loci offer superior alignment opportunities based on their amino acids (Abascal et al., 2010;Bininda-Emonds, 2005;Hall, 2005), model inference performs better when codon positions are known (Shapiro et al., 2005). Different rates of evolution can also be allowed for each exonic region, and when DNA saturation is high (e.g., in phylogenetic inference of older lineages), amino-acid models may be used instead (Simmons, 2017). Finally, to maximize phylogenetic information, the evolutionary rate of loci can be matched to the phylogenetic scale being explored, with slow evolving sequences targeted for deep divergences, and fast evolving sequences for recently diverged lineages (Philippe et al., 2011). The sequence Sets 1 and 2 developed here, represent the two most conserved sets of loci, and should be highly suitable to resolve the deeper divergences in the Diurideae, whereas sequence Sets 4 and 5 are more appropriate for species level divergences in the subfamilies they target. The latter type of locus is ideally suited to coalescent-based phylogenetic methods, which can account for differential lineage sorting among loci in rapidly diverging lineages (Carstens and Knowles 2007). However, it remains to be seen whether these analyses will result in higher phylogenetic resolution compared to that already obtained with the k-mer data set.

| Lessons and tips for future studies
Achieving a high level of sequence capture success across moderately wide phylogenetic scales has been a common objective of many sequence capture projects. While noting there is no stand- In this study, our goal was to simultaneously capture in a single hybridization step hundreds of informative loci that collectively enable the phylogenetic resolution of relationships spanning among subtribes, through genera and species to subspecific phylogeographic structure within species. To achieve this goal, we employed a tiered approach drawing on a total of 18 transcriptomes that we then subdivided into five different sets, spanning different phylogenetic scales, in an effort to achieve the sequence capture of loci offering phylogenetically informativeness at different levels.
Our evaluation of the performance of our five-target sequence sets in capturing sequences from target and non-target taxa, demonstrates the importance of ensuring probe design incorporates sequence variation spanning the phylogenetic breadth of interest.
For example, for Sets 1 and 2, targeting the entire tribe and outgroup, and for the target species relevant to Sets 3 to 5, sequence capture success was uniformly high (>89%). However, across Sets 3 to 5, sequence capture success fell (66%-70%) for the non-target species (Figures 4 and 5).
One alternative way to evaluate probe set performance across target and non-target taxa is to consider IQ-TREE estimates of concordance factors, and ASTRAL estimates of normalized quartet score across the D67 samples. As expected (Minh, Hahn, et al., 2020), gCF and sCF factors were correlated across the species trees for Set 1 to 5 ( Figure S3), and showed a strong positive relationship with branch length (Figures S3 and S4). Nonetheless, average gCF values were highest for Set 2, which was designed to span the Diurideae (median of 77), and lowest for Set 5 (median of 58), which was designed for the much narrower taxonomic target of the Caladeniinae. Median sCF scores were similar across the sets ( Figure S3). The ASTRAL normalized quartet scores were similar for Set 1 and 2 (0.903 vs. 0.906, respectively), marginally lower for Set 3 (0.898) and Set 4 (0.875), and lowest for Set 5 (0.858). Collectively, these findings demonstrate an increase in gene tree discordance as the number of non-target taxa increased. It follows, that in practice sequence Sets 4 and 5 should not be chosen as a whole for phylogenetic analysis beyond their target groups of the Drakaeinae and Caladeniinae, respectively. However, given that there was only a modest decrease in the median gCF scores and in the ASTRAL normalized quartet scores across the D67 samples for Sets 4 and 5, these sequence sets may still contain many loci that will be informative beyond the initial taxonomic targets. Therefore, in order to maximize high levels of uniform coverage we recommend that probes are designed from sequence data spanning the phylogenetic breadth of interest.
Nonetheless, some probe loci are likely to be successfully captured and be potentially phylogenetically informative far beyond initial narrower phylogenetic targets.

| CON CLUS IONS
Three complementary sequence sets targeting the Diurideae (Sets 1 to 3) have been confirmed to successfully achieve the sequence capture of more than 1,000 nuclear loci. Our exploratory phylogenetic analyses, based on concatenated coding gene sequences, indicate congruence at deeper phylogenetic levels with earlier molecular studies based on just one or a few genes (Kores et al., 2000(Kores et al., , 2001Weston et al., 2014). At the same time, access to hundreds more nuclear loci promise to help fill remaining gaps in our understanding of the phylogenetic relationships across the tribe. Set 5, targeting closely related species of Caladenia, revealed unprecedented phylogenetic resolution when compared with earlier studies (e.g., Clements et al., 2015;Swarts et al., 2014) based on just a few loci), and also indicates potential to detect phylogeographic structure within some species. The uniform coverage of hundreds of loci across every sample was also achieved (Figure 7). Our exploration of homologous k-mer analysis reveals overlooked advantages for phylogenetic inference from supermatrices of sequence capture data, including: (i) the rapid assembly of data while maximizing the use of homologous sequences (e.g., target exons and introns, untranslated regions, and other informative off-target sequences), and (ii) the recovery of robust phylogenetic structure that is biologically meaningful.
Given that many phylogenetic, phylogenomic and taxonomic uncertainties remain across this hyperdiverse family (e.g., Chen et al., 2019;Hu et al., 2020;Li et al., 2019;Mendoza et al., 2020;Perez-Escobar et al., 2020;Salazar et al., 2018;Simo-Droissart et al., 2018), the Orchidaceae is a prime candidate for wider testing of the phylogenetic and phylogenomic utility of targeted sequence capture methods. Beyond orchids, our findings indicate that by drawing on the increasing availability of transcriptomes, a customized multitiered sequence capture strategy, in combination with promising, yet underutilized phylogenomic approaches, will be effective for many plant groups.

DATA AVA I L A B I L I T Y S TAT E M E N T
The Dryad repository (https://doi.org/10.5061/dryad.z08kp rrbj) contains: (i) DNA sequences in Fasta format for each of sequence Sets 1 to 5, as provided to Roche NimbleGen for the probe design.