The impact of estimator choice: Disagreement in clustering solutions across K estimators for Bayesian analysis of population genetic structure across a wide range of empirical data sets

The software program STRUCTURE is one of the most cited tools for determining population structure. To infer the optimal number of clusters from STRUCTURE output, the ΔK method is often applied. However, a recent study relying on simulated microsatellite data suggested that this method has a downward bias in its estimation of K and is sensitive to uneven sampling. If this finding holds for empirical data sets, conclusions about the scale of gene flow may have to be revised for a large number of studies. To determine the impact of method choice, we applied recently described estimators of K to re‐estimate genetic structure in 41 empirical microsatellite data sets; 15 from a broad range of taxa and 26 from one phylogenetic group, coral. We compared alternative estimates of K (Puechmaille statistics) with traditional (ΔK and posterior probability) estimates and found widespread disagreement of estimators across data sets. Thus, one estimator alone is insufficient for determining the optimal number of clusters; this was regardless of study organism or evenness of sampling scheme. Subsequent analysis of molecular variance (AMOVA) did not necessarily clarify which clustering solution was best. To better infer population structure, we suggest a combination of visual inspection of STRUCTURE plots and calculation of the alternative estimators at various thresholds in addition to ΔK. Disagreement between traditional and recent estimators may have important biological implications, such as previously unrecognized population structure, as was the case for many studies reanalysed here.


| INTRODUC TI ON
To date, one of the most cited tools to determine genetic population structure is the software program STRUCTURE (Pritchard et al., 2000). STRUCTURE is a free software package that uses multilocus genotype data and a Bayesian clustering approach relying on a Monte Carlo Markov Chain (MCMC) algorithm to infer population structure and assign individuals to populations based on their genotypes. The specification of models and the use of a random walk approach allows users to more easily incorporate prior information and account for uncertainty when clustering. In addition, STRUCTURE accepts common genetic marker types as input such as amplified fragment length polymorphisms (AFLPs), restriction fragment length polymorphisms (RFLPs), single nucleotide polymorphisms (SNPs), and microsatellites. Falush et al. (2003) built upon STRUCTURE by developing models that allow inference of population structure with linked loci and correlated allele frequencies. Using the correlated allele frequencies method quickly became the gold standard for parsing samples into population clusters, because it assumes a level of nonindependence. This model could uncover previously undetected correlation without impacting the results if the correlation did not exist (Falush et al., 2003;Porras-Hurtado et al., 2013).
Important to the function of STRUCTURE is the identification of clusters, which represent the main genetic divisions or "subpopulations" within a species (Kalinowski, 2011;Puechmaille, 2016). A common problem for clustering algorithms is to determine which clustering solution is the best (Hoban et al., 2012;Novembre, 2016). The K estimation method implemented in STRUCTURE is the posterior probability of the data for a given K (ln Pr(X|K) and it has been widely used for determining the optimal number of clusters and assigning individuals to clusters.
However, determining the maximal value from the posterior probability distribution is difficult, as peaks are not always clear (Evanno et al., 2005;Pritchard et al., 2000). To complicate matters further, in cases in which STRUCTURE model assumptions are violated, such as the presence of hierarchical population structure, clustering solutions may be affected and subject to overinterpretation (Lawson et al., 2018).
To solve this issue, Evanno et al. (2005) developed the ΔK statistic which is an ad hoc quantity related to the second order rate change of the log probability of data with respect to the number of clusters (Evanno et al., 2005). The ΔK statistic has since been a popular method for determining the number of clusters and has been cited over 12,000 times. Evanno et al. (2005) state that when the ΔK method was used on their simulated data, ΔK accurately estimated the true K, with the reservation that partial or uneven sampling could compromise the statistic from revealing the true number of clusters.
In addition, the ΔK method makes some biologically simplistic assumptions, which may not hold with real populations and their complex relationships. Specifically, Evanno et al. (2005) used a hierarchical island model of gene flow which assumed that all groups of populations were equally different from each other (Kalinowski, 2011). Overlying complex biological relationships, and uneven sampling appears to affect the accuracy of the ΔK method, as well as the program STRUCTURE itself (Puechmaille, 2016;Toyama et al., 2020). For instance, Kalinowski (2011) states that in some cases, STRUCTURE simply put all the individuals from the largest population sample in the same cluster. To remedy the uneven sampling problem, four alternative best K estimators, commonly referred to as Puechmaille statistics, were created (Puechmaille, 2016). Puechmaille (2016) tested the robustness of ΔK when hierarchical levels of population structure were detected in simulated and empirical data sets and found that ΔK did not compensate for the inability of STRUCTURE to cluster subpopulations correctly, and thus ΔK could not reliably recover the true number of clusters. This is crucial because many empirical data sets display hierarchical population structure and using the ΔK method without a proper hierarchical analysis could lead to a faulty conclusion of the number of clusters. In a meta-literature review of 1,264 studies that used ΔK, the authors found that very few studies performed the hierarchical analysis recommended by the authors of both ΔK and STRUCTURE to fully explore population subdivision (Janes et al., 2017). Janes et al. (2017 also found that over half of the studies that used ΔK concluded that the best K was 2. Further investigation of this issue revealed that ΔK was biased towards 2 due to either the presence of hierarchical populations structure, or when structure is limited (K = 1) (Cullingham et al., 2020). This echoes previous work on best practices for running STRUCTURE in which authors advise paying special attention to cases of K = 1 due to the inability of the ΔK method to detect such a case (Gilbert et al., 2012). Puechmaille (2016) tested the alternative K estimators using almost exclusively simulated data modelled on microsatellite markers.
Yet, simulated data may not reflect empirical data well, particularly in organisms with complex population structure. Thus, with many available K estimation tools, a large-scale meta-analysis of empirical data comparing the functional outcome of estimator choice could assist researchers in methodology decisions. Previous work has evaluated the impact of different STRUCTURE parameters on determining the optimal K in empirical data (Funk et al., 2020); however, to date no study has evaluated the impact of choice of K estimator across a wide range of empirical data sets. If estimators largely disagree, greater emphasis on methodology decisions is needed and a large number of population genetic studies may need to be revised. To provide a comprehensive analysis of the choice of method to determine the optimal K on the outcome of population genetic studies, we re-estimated genetic structure patterns based on a total of 41 microsatellite data sets; 15 from a broad range of taxa and 26 derived from one taxonomic group, corals. We tested the alternative K estimators of Puechmaille (2016) and compared the results to the outcomes of using traditional best K estimation methods (ΔK and posterior probability). Our objectives were: (1) determine the degree of disagreement between alternative K estimators and traditional K estimators in empirical data sets from a phylogenetically broad and a focused group of species (ΔK and posterior probability), (2) analyse potential causes of any disagreement between K estimation methods across data sets (sampling scheme and study organism), and (3) determine the best way to reconcile traditional K estimation methods with newer methods.

| Data set selection
To determine whether study organism impacts disagreement between K estimation methods, two data set collections were compiled ("focused" and "broad"). The "focused" category was comprised of microsatellite studies on corals known to have complex population structures influenced by ocean currents. To test if findings in the "focused" group are extendable to other systems, this was complemented by the "broad" category of microsatellite studies on a wide range of other terrestrial, freshwater, and marine taxa. To compare the four alternative K estimators (Puechmaille, 2016) to traditional methods (ΔK and ln Pr(X|K)), we first conducted a literature review of coral population genetics studies by searching the Web of Science using keyword combinations "coral population genetics" and "coral AND population genetics". From these searches we assembled a database of coral microsatellite data sets to represent our focused study system. To assemble a database of broad representation of taxa, we performed a search on The Dryad Digital Repository using the keywords "microsatellite population genetic structure". Studies based on single nucleotide polymorphism (SNP) data were excluded, as Puechmaille (2016) tested the alternative estimators using only microsatellite data. Puechmaille (2016) states that further testing is necessary to confirm whether conclusions about the alternative estimators can be extended to SNP data sets. Further, since Puechmaille (2016) created these estimators to analyse output from the software program STRUCTURE (Falush et al., 2003), data sets were selected if they had been analysed using STRUCTURE. Additionally, we selected data sets that met two criteria: loci were not found to be under selection and population structure was analysed using a minimum of five microsatellite loci.

| Broad data sets
The "broad" category included 15 studies, each targeting a different species from a wide range of taxonomic groups including plants and animals of marine, freshwater, and terrestrial habitats. The sample size across these data sets ranged from 73 to 913 individuals, and thus, sampling effort differed among studies (See Table S1). This group serves to provide a benchmark against which to compare the data sets focused on one phylogenetic group outlined below.

| Focused data sets
The "focused" category included 26 data sets targeting 20 coral species. The sample size of data sets in the "focused" category also varied (64 to 2,014 individuals; Table S1). Corals were specifically chosen to represent the "focused" category of data sets for the reasons outlined below. STRUCTURE and the ΔK method have been widely applied to the detection of population genetic structure in marine organisms with planktonic dispersal and complex life histories (Palumbi, 2003).
Corals are chief among them (Baums et al., 2012;Ledoux et al., 2015;Nakajima et al., 2017;Ruiz-Ramos et al., 2015). Corals' diverse life histories include asexual and sexual reproductive modes for some species (Baird et al., 2009). STRUCTURE plots often show complex patterns and determination of the best K results can be problematic in such cases (Lukoschek et al., 2016;Warner et al., 2015). It is unclear, however, whether the complex patterns are the result of biological phenomena such as unidentified cryptic species (Boulay et al., 2013), violations of the corresponding model assumptions such as overlapping generation times (Potts, 1984), extensive inbreeding (Richards & Oppen, 2012), isolation by distance (Aurelle & Ledoux, 2013), lack of strong differentiation, or poorly performing genetic markers (i.e., null alleles) (Dubé et al., 2017).
The complexity and diversity of corals makes for an excellent taxonomic group with which to test the performance of best K estimators under less simplistic conditions than represented by simulated data. Yet, the taxa in the coral group have a shared evolutionary history. .

| Population structure analysis
To assess the performance of each estimator on empirical data, we analysed each microsatellite data set using ParallelStructure (Besnier & Glover, 2013). To ensure comparability of the results, we ran our analysis with the STRUCTURE parameters described in the corresponding study. All studies considered each repeated multilocus genotype only once before running STRUCTURE. In the "focused" category, all 26 studies ran STRUCTURE under the admixture model and 24 studies used the correlated allele frequencies model. Seventeen of the studies used a location prior (Hubisz et al., 2009) to assist with clustering. In the "broad" category, all 15 studies used the admixture model, 14 studies used the correlated allele frequencies model, and three were run using a location prior. Complete details for the parameter settings of each data set can be found in Supporting Information on Dryad (https://doi.org/10.5061/dryad. zgmsb ccck).
First, we calculated the ΔK and the posterior probability (which relies on ln Pr(X|K)) estimate using Puechmaille's (2016) R script Kestimator V-1.13. We then estimated the best K according to Puechmaille's four alternative K estimators using the same R script (Puechmaille, 2016) Table S1). An ANOVA was performed on a linear mixed model fit by residual maximum likelihood (REML) to determine the effect of threshold on disagreement with ΔK. Following the same method, each Puechmaille statistic was compared to the posterior probability estimate based on ln Pr(X|K) described in (Pritchard et al., 2000). CLUMPAK (Kopelman et al., 2015) was used to visualize STRUCTURE plots.
In addition, to assess support for clustering solutions, analysis of molecular variance (AMOVA) (Excoffier et al., 1992) was conducted using Poppr v2.9.1 (Kamvar et al., 2014) with 999 permutations for a randomly selected subset of two data sets from each category ("focused" and "broad") in which all alternative estimators disagreed with both the ΔK and the ln Pr(X|K). For each of the four data sets, individuals were assigned by majority rule according to their membership coefficients into the number of clusters identified by the different K estimation methods (the alternative estimators, the ΔK, and the ln Pr(X|K)). AMOVA was run on each clustering solution for each data set, with only two exceptions. For the data set Baums et al. (2010), all alternative estimators found K = 1. AMOVA requires >1 group in order to compare variation between groups, thus, it was not run on a clustering of individuals into one singular population. For the data set Perez et al. (2014), majority rule assigned individuals to only 11 clusters, with no single individual having a membership coefficient high enough for assignment to a twelfth cluster. Thus, K = 12 as identified by the posterior probability method was excluded from AMOVA.

| Assessment of sampling strategy
The program STRUCTURE may not reliably estimate the true number of clusters when sampling is uneven (Puechmaille, 2016).
Consequently, we calculated sampling evenness scores for each data set to test whether the alternative estimators perform differently than traditional methods in situations of uneven sampling. We calculated the evenness score, E, for each data set using Shannon's diversity index (Equation 1). The number of multilocus genotypes (MLGs) per sampling site was used to calculate the evenness of each data set with respect to sampling scheme.
The result of the equation below yields a score from 0 to 1 where higher scores indicate a more even sampling scheme (see Table S4 for calculations).
where E = eveneness, N isite = number of MLGs at site, and N itotal = total number of MLGs.
Next, we tested if there was a relationship between the proportion of the new K estimators that were congruent with each traditional method (ΔK and ln Pr(X|K)) and the evenness of the sampling strategy. To do so, we performed a linear regression with sampling evenness as a predictor for proportion agreement.

| Comparison of K estimator performances: Focused category
For each data set in the "focused" coral category, 16 estimates of K were calculated from the R script Kestimator V-1.13 (the four alternative K estimators, each at four membership thresholds). The script also calculated the traditional ΔK and the posterior probability estimates.
The proportion of these 16 alternative K estimators that were congruent with the ΔK method varied across studies. The relative frequency of coral studies in which less than 20% of the 16 new K estimators agreed with the ΔK estimate was 50% ( Figure 1a). Additionally, most (62%) of the studies had less than 50% agreement with ΔK.
The alternative K estimators tended to return higher values of K than the ΔK method, with only two exceptions. On average, the MaxMeaK and the MedMeaK estimates, each at a membership threshold of 0.8, returned lower values of K than the ΔK method ( Figure 2a). In the empirical "focused" category data sets we analysed here, lower membership coefficient thresholds led to a higher magnitude of disagreement from ΔK across all four new estimators ( Figure 3). The effect of threshold was significant on the disagreement from ΔK according to ANOVA on a linear mixed model fit by

| Comparison of K estimator performances: Broad category
Nearly all the patterns present in the "focused" data set analysis were mirrored in the "broad" data set category analysis. Nine of the data sets in the "broad" category had lower than 20% proportion agreement between the alternative Puechmaille statistics and the ΔK estimate ( Figure 1c). Additionally, on average, all alternative K estimators were higher than the ΔK estimate regardless of threshold ( Figure 2c).
Proportion agreement between the posterior probability estimate and the alternative statistics was similarly low, with 11 out of the 15 "broad" category data sets showing less than 20% proportion agreement ( Figure 1d). In comparison to the posterior probability estimate, the Puechmaille statistics resulted in lower estimates of K on average (Figure 2d)-again, consistent with the trend present in the "focused" category of data sets ( Figure 2b). As in the "focused" data sets, lower membership coefficient thresholds led to a higher magnitude of disagreement from ΔK ( Figure 3).

| Influence of sampling effort on K estimates: Focused category
In the coral data sets, we found no significant relationship between sampling evenness and the proportion of alternative K estimators that agree with the ΔK estimator (Figure 4a) or the posterior probability ( Figure 4b). We compared sampling evenness and proportion F I G U R E 2 Average discrepancy from ΔK and posterior probability (+/-SEM). The alternative estimators include MaxMeaK (maximum of means), the MaxMedK (maximum of medians), the MedMeaK (median of means), and the MedMedK (median of medians). Each estimator was calculated at four membership coefficient thresholds (0.5, 0.6, 0.7, 0.8) which are shown on the x-axis. For the "focused" category (n = 26), the difference from each of the 16 alternative estimators to (a) the ΔK estimate and to (b) the posterior probability was calculated and averaged across all 26 studies for each estimator (shown on the y-axis). For the "broad" category (n = 15), the difference from the alternative estimators to (c) the ΔK estimate and to (d) the posterior probability is also shown. SEM =standard error of the mean F I G U R E 3 Difference from ΔK by threshold. A randomly selected subset of the alternative K estimators from a randomly selected subset of data sets from both the "focused" and "broad" categories is shown here to illustrate the effect of threshold for the alternative estimators (Puechmaille, 2016) on the magnitude of deviation from ΔK

F I G U R E 4
Sampling scheme and estimator precision. For each data set, an evenness score was calculated using the Shannon diversity index (plotted on the x-axis). Studies which had a more even number of samples taken from each site had a higher score (between 0 and 1). The proportion of alternative estimators that agreed with (a) the ΔK estimate and (b) the posterior probability was calculated for each data set in the "focused" category (n = 26) and plotted on the y-axis. A linear regression was plotted (dotted line) to show the relationship between evenness and proportion agreement with the (a) ΔK estimate (Adj. R 2 = -0.016; intercept =1.049 ; slope = -0.781; p-value =0.444) and (b) the posterior probability (Adj. R 2 = 0.079; intercept =2.415; slope = -2.182; p-value =0.088 ). Similarly, for data sets in the "broad" category (n = 15) a linear regression between evenness and proportion agreement with (c) ΔK estimate (Adj.

| Influence of sampling effort on K estimates: Broad category
Echoing the trends found in the "focused" category, the 15 data sets included in the "broad" category also returned no significant relationship between sampling evenness and proportion agreement for either the ΔK estimator (Figure 4c; R 2 = 0.231, p = 0.070) or the posterior probability (Figure 4d; R 2 = 0.10, p = 0.258) under a linear model. Each data set was weighted by sample size for all tests.
Additionally, we visualized the STRUCTURE plots for the Testudo hermanni data set of Perez et al. (2014) as an example with a relatively low evenness score (E = 0.84, see Table S1). The reanalysis yielded complete agreement between the Puechmaille statistics that contrasted with published findings using traditional methods (ΔK and ln Pr(X|K)). The authors reported a K = 5 (Figure 5a), however, alternative estimators reported a K = 7 (Figure 5b).

| Precision of Puechmaille estimates
Across all 41 data sets, the 16 Puechmaille estimates most commonly offered two (13/41 data sets) or one (11/41 data sets) K estimate(s) (See Table S1). In 75.6% of cases, the range of solutions offered by the Puechmaille estimators was ≤3. The largest range of K estimates provided by the Puechmaille statistics was 6 (1/41 data sets) and was found in only one data set, Kurita et al. (2014).

| Analysis of molecular variance
From the "broad" category, the data sets Kim et al. (2017) andPerez et al. (2014) were randomly selected. From the "focused" category, data sets Baums et al. (2010) and Rippe et al. (2017) were randomly selected. K estimation for each data set included a range of K values each supported by different K estimation methods (Table 1). Across all data sets and all clustering solutions, the majority of the variation was explained by differences within samples (Table 2). Additionally, the proportion of variation across all strata levels (between clusters, between samples within clusters, and within samples) were significant (p <.01 in all cases; Table S5) across all data sets and clustering solutions. The magnitude of the proportion of variation explained by differences between clusters varied by data set (Table 2). However, a notable trend found in all clustering solutions across all data sets, was the slight increase in the magnitude of the proportion of variation attributed to differences between clusters with an increase in K (Table 2). breeding system, and demographic history (Bohonak, 1999;Dillane et al., 2008;Les, 1988). The development of cost-effective molecular markers for non-model organisms combined with the adoption of Bayesian methods to detect even weak signals of population genetic structure has propelled the field forward (Baums et al., 2005;Garris et al., 2005;Latch et al., 2006). Yet, especially in non-model organisms, the determination of the best solution among the tested number of clusters in a Bayesian model can be difficult. Here, we report that the best K estimators (Puechmaille, 2016) suggest more population genetic structure across the majority of empirical coral microsatellite data sets tested compared to the most popular K estimation method, ΔK estimator. In contrast, the alternative estimators suggested less genetic structure than the posterior probability (ln Pr(X|K)). These patterns hold when extended to a broad group of unrelated taxa, and results agree with a previous study using simulated data sets (Puechmaille, 2016). Even sampling effort among populations is expected to lead to more accurate determination of best K and yet we found no significant relationship between sampling evenness and proportion agreement in the empirical data.

| Comparison to ΔK
Because we used the same parameters for STRUCTURE modelling that were used in the original studies, if there was hierarchy among clusters present, it remained intact. In other words, genotypes in the original and in our reanalysis were always split between the first two clusters in the same way, and then were assigned to the next cluster in the same way, and so forth for each higher number of K. This design allowed us to compare the solution suggested by the alternative K estimators to the results of the original studies. Alternative estimators agreed with the ΔK method across thresholds only when the best K was one or two ("focused" category: five out of 20 species, "broad" category: one out of 15 species, Table S1). In most other cases, alternative K estimators suggested that species may have more pronounced population structure than previously thought. In the "focused" data set category, 11 out of 20 species of varying habitat type and study design displayed this phenomenon. In the "broad" category, in 10 out of 15 studies alternative K estimators returned higher K solutions. Thus, across a wide range of taxa, the alternative K estimators indicated more population genetic structure than the ΔK method.
One notable case where we found evidence for more pronounced population structure was in the "focused" category data set corresponding to the coral Porites lobata (Baums et al., 2012). Initially, the ΔK method returned a best K = 5. Porites lobata from the Eastern Tropical Pacific were distinct from colonies from the central Pacific and Hawaii (Baums et al., 2012). Within Hawaii, there existed three co-occurring clusters that remained distinct from the remainder of central Pacific clusters. Another clustering algorithm, GENELAND (Guillot et al., 2005), returned a best K of seven with the possibility of an additional cluster being split in two, yielding nine clusters in total (Baums et al., 2012). Upon reanalysis with the alternative K estimators, the clear majority (14/16 estimators) pointed to a best K between seven and nine. One estimator agreed with ΔK and another reported a lower estimate of K = 4. The main conclusion of the study that there is a lack of geneflow across the eastern pacific barrier was upheld (see also Wood et al., 2016), but our reanalysis suggested additional population structure in the central Pacific with putative conservation implications at the regional scale.
Conversely, in some select cases the ΔK estimate was higher than the alternative estimates. One such case in the "focused" category was the data set corresponding to the coral Acropora digitifera (Nakajima et al., 2012). Though the ΔK estimate returned a best K of 2, the authors found evidence to suggest there was only one population. ΔK is known to be unable to report when the best K is 1 and instead most often reports K = 2 (Cullingham et al., 2020). However, the alternative Puechmaille statistics identified the best K = 1, except for those at the lowest (0.5) threshold. This same phenomenon can be found in the "broad" category of data sets in the New Zealand Sea Lion, Phocarctos hookeri (Osborne et al., 2016). Again, here the ΔK estimate suggested two populations. However, Osborne et al. (2016) found weak population differentiation with F ST values low enough to suggest no population structure and concluded that the result was more consistent with one population of individuals living in familial clusters. All of the alternative Puechmaille statistics again identified a best K of 1, except those at the lowest threshold.
This highlights the benefit to calculating these alternative statistics, while considering a range of thresholds. In adding to the recommendations by Cullingham et al. (2020) for determining when K = 2, we propose using the alternative estimators to help determine the level of support.
In four cases within the "focused" category, the alternative estimators showed little agreement amongst themselves and with ΔK in their best K solutions (Table S1). We suggest that difficulties in determining the best K can arise from hidden genetic diversity in the investigated species (Hajibabaei et al., 2007;Hebert et al., 2004).
Seriatopora hystrix had a wide spread of best K estimates with ΔK suggesting K = 3. The authors conducted a hierarchical analysis investigating all three clusters further. Clusters were grouped based on regional scales of clustering and five major genetic clusters were distinguished. However, reanalysis with new estimators suggested a minimum K = 4 and a maximum K = 9 (Table S1). The authors mention that cryptic species may have masked the true population connectivity signals, further investigation of which may be warranted based on our reanalysis of population structure. Corals hybridize frequently and the speciation process in this group may follow a pattern of reticulate evolution and thus cryptic lineages at all taxonomic levels are expected to be common (Kenyon, 1997;Veron, 1995;Vollmer & Palumbi, 2002;Willis et al., 2006).
In the "broad" category, one case in which all alternative estimators agreed with one another, but disagreed with ΔK occurred in a data set for Testudo hermanni, an endangered tortoise species in Mediterranean (Perez et al., 2014). All alternative Puechmaille statistics indicated K = 7, while the posterior probability indicated K = 12.
However, the ΔK estimate found the best K = 2. Perez et al. (2014) used STRUCTURE and GENELAND (Guillot et al., 2005) to draw their conclusions about population structure in this study. Using STRUCTURE, the authors found evidence for K = 2 and K = 5 by employing several traditional K estimation methods (ΔK and ln Pr(X|K)).
However, GENELAND analysis indicated a best K = 6. Curiously, under K = 5, samples from geographically distant localities (Spain, Sicily and Corsica) clustered together according to STRUCTURE (Figure 5a). The authors assert that massive translocations between Spain, Sicily, and Corsica are unlikely for this sedentary species of tortoise and instead suggest that prehistoric events could explain the admixture (Perez et al., 2014). However, the alternative estimators suggest K = 7 (Figure 5b). At K = 7, Spain, Sicily and Corsica con-

| Comparison to posterior probability
In both the "focused" and the "broad" categories, the posterior probability method yielded higher estimates of K than the alternative estimators and the ΔK method. This result could be due to the fact that ln Pr(X|K), the basis for calculating the posterior probability according to Bayes rule, is known to be sensitive to the STRUCTURE model which allows for allele frequencies to be correlated between subpopulations (Falush et al., 2003). The STRUCTURE manual recommends that default settings should include allowing for correlated allele frequencies, and indeed most (38/41) data sets reanalysed here, regardless of category, followed this recommendation.
However, Falush et al. (2003) find that this could result in a higher risk of overestimating K compared to the independent allele frequencies model. Since the alternative estimates of K are lower, on average, than the estimates calculated by the posterior probability method, it is possible that the Puechmaille statistics are less sensitive to such deviations in model assumptions. This corroborates the simulation study by Puechmaille (2016), which exclusively used the correlated allele frequencies model, showing that the posterior probability method overestimated the true K.

| Analysis of molecular variance
As the true K cannot be known in empirical data, we applied analysis of molecular variance (AMOVA) to a subset of data sets to evaluate its use as a method for determining which clustering solution was most supported. Data sets in which there was full disagreement between the Puechmaille statistics and both traditional K estimation methods were selected, as these cases are the most difficult to interpret and additional analysis is warranted to determine the best clustering  (Meirmans, 2015). However, Meirmans (2015) indicate that reporting F ST values is acceptable. With the expectation that the magnitude of significant variance explained by differences between clusters should be maximized in the best solution, we compared AMOVA results across clustering solutions for each data set. Perhaps not unexpectedly, we noted that across data sets, the proportion of variance explained by differences between clusters increased slightly with increasing number of clusters, K (Table 2). This finding is similar to the results of a recent simulation-based study which found that the magnitude of ΔK was correlated with F ST , with higher values of ΔK having more supported population structure (Cullingham et al., 2020). It may well be that the magnitude of variance explained is simply always maximized at the highest value of K. This could be the case if increasing the number of model parameters by adding more clusters increases the distance between clusters. Thus, a simulation study is necessary TA B L E 2 Analysis of molecular variance (AMOVA) across clustering solutions for randomly selected data sets in the "broad" category (a) and the "focused" category (b) to assess whether AMOVA can assist with identifying the best clustering solution.

| Evenness assessment
Since the inception of STRUCTURE, Evanno (2005) and others have warned users that uneven sampling across strata may influence the accuracy of determining the best K (Evanno et al., 2005;Kalinowski, 2011;Puechmaille, 2016). In fact, previous work has recommended modifying alpha values when running STRUCTURE to address this (Wang, 2017). Because STRUCTURE can detect weak population signals (Latch et al., 2006), Puechmaille (2016) theorized that uneven sampling was the main contributor to ΔK's inability to identify the correct K. Further, previous work has found that uneven sampling design in a multispecies empirical data set did impact STRUCTURE results (Meirmans, 2019). Thus, we initially hypothesized that the discrepancy between Puechmaille's estimators and ΔK was due to uneven sampling across clusters.
ΔK is affected by uneven sampling because STRUCTURE tends to place individuals from an oversampled subpopulation into their own cluster while putting a sparsely sampled subpopulation into its own cluster, regardless of the true evolutionary history.
Puechmaille's new estimators avoid this by implementing a range of cluster membership coefficients (from least stringent, 0.5 to most, 0.8) and accounting for maximum population subdivision via the estimators MaxMeaK and MaxMedK, thus correcting for STRUCTURE's downward biased estimates of K.
To test how sampling evenness affects best K estimates, we calculated evenness scores for each study ( Figure 4) and correlated these scores with the proportion of estimators that agreed among all best K estimators. Surprisingly, we found no significant relationship between sampling evenness and proportion agreement among best K estimators for both the "focused" and "broad" category data sets. In fact, a subset of studies at all levels of sampling evenness had high proportion agreement scores. The study that was the least evenly sampled, had one of the highest proportion agreement scores.
The poor power of sampling evenness to predict the ease of which the best K could be determined may stem from overestimating evenness. In human studies, populations are typically grouped based on linguistic, cultural, or physical characters and then sampled as evenly as possible (Pritchard et al., 2000). However, a priori stratification of many non-model organisms into sampling groups is often not possible due to a lack of obvious phenotypes and poor understanding of metapopulation structure. In fact, the latter is often a motivation to conduct a STRUCTURE analysis.

TA B L E 2 (Continued)
while others were under-sampled. Therefore, evenness scores as calculated here for a given study might be high and yet do not reflect even sampling of populations. Additionally, even sampling of populations across a species' range is logistically challenging.
Oversampling may occur at the center of a species' range because there are more individuals per unit area making sampling easier.
Likewise, undersampling may occur at the margins of the range because, by definition, organisms occur at lower density requiring higher sampling effort.
Regardless of the reason why there is a lack of correlation between evenness and ease of determining the best K in this meta-analysis, it is very difficult to achieve even sampling across populations in practice even if it is desirable. It thus behooves us to use population genetics tools that can deal with reality by correcting for sampling unevenness ex post facto, as the alternative estimators do. We recommend using ΔK and the posterior probability to get a basic cluster estimation, followed by an analysis that uses all alternative K estimators at a range of thresholds. Since each estimator has different sensitivities and choice of threshold has a significant effect on result, comparing each to ΔK and the posterior probability during analysis offers the most robust procedure for estimating K in the case of potentially ambiguous sampling evenness. We additionally recommend inspecting STRUCTURE plots to tease out the best estimation of K in case new estimators give an ambiguous result (rare). Combining all four strategies-the ΔK, the posterior probability, the alternative K estimators, and examination of STRUCTURE plots-ensures the most robust estimation of K and will allow researchers to detect biological subtleties that may not be recognizable using the ΔK estimate alone.

| Final thoughts
Our comprehensive reanalysis of population genetic structure across both a focused group of taxa (corals) and a broad group of unrelated taxa from across the Tree of Life indicates that population genetic structure may be more pronounced than previously described. The alternative K estimators typically agreed with each other across thresholds and ΔK when there was clear population structure across space. However, there were cases showing disagreement amongst estimators when population structure was more complicated, for example when sympatric samples were assigned with high probability to different clusters. Since the new estimators more accurately predicted K than ΔK's and the posterior probability's predictions in studies where the best K was known (i.e., simulated data; Puechmaille, 2016) and there were substantially more empirical studies whose alternative K estimates differed drastically from traditional K estimation predictions than agreed with it (See Table S2), we recommend the incorporation of the alternative estimators to determine the best K.
Our finding of little agreement between K estimation methods across a wide range of data sets and sampling schemes indicate that choice of estimator has a substantial impact on the results in empirical data. We find here that due to large scale disagreement between estimator solutions across data sets, a multiestimator approach is always required. Additionally, broader reanalysis of existing microsatellite data sets may be warranted and has the added benefit of preserving these data sets for future use as many of these data sets were published before the advent of online repositories.

ACK N OWLED G EM ENTS
We thank all authors who made their microsatellite data publicly available.