Methodological advances for hypothesis‐driven ethnobiology

Ethnobiology as a discipline has evolved increasingly to embrace theory‐inspired and hypothesis‐driven approaches to study why and how local people choose plants and animals they interact with and use for their livelihood. However, testing complex hypotheses or a network of ethnobiological hypotheses is challenging, particularly for data sets with non‐independent observations due to species phylogenetic relatedness or socio‐relational links between participants. Further, to account fully for the dynamics of local ecological knowledge, it is important to include the spatially explicit distribution of knowledge, changes in knowledge, and knowledge transmission and use. To promote the use of advanced statistical modelling approaches that address these limitations, we synthesize methodological advances for hypothesis‐driven research in ethnobiology while highlighting the need for more figures than tables and more tables than text in ethnobiological literature. We present the ethnobiological motivations for conducting generalized linear mixed‐effect modelling, structural equation modelling, phylogenetic generalized least squares, social network analysis, species distribution modelling, and predictive modelling. For each element of the proposed ethnobiologists quantitative toolbox, we present practical applications along with scripts for a widespread implementation. Because these statistical modelling approaches are rarely taught in most ethnobiological programs but are essential for careers in academia or industry, it is critical to promote workshops and short courses focused on these advanced methods. By embracing these quantitative modelling techniques without sacrificing qualitative approaches which provide essential context, ethnobiology will progress further towards an expansive interaction with other disciplines.


I. INTRODUCTION
It is becoming increasingly clear that the human dimension of conservation science plays a significant role in solving various traditional and emerging conservation crises (Poe, Norman & Levin, 2014). To be successful, solutions to these conservation problems must be culturally appropriate, judged by how they are adopted and implemented by and with the local people who live with the biodiversity we want to conserve (Winter, Ticktin & Quazi, 2020). Therefore, it is critical that we study, understand, and value local people's knowledge of their environment and how they manage it. Embracing the contribution that ethnobiology can make in this context is often treated in parallel in the literature with the discipline often misunderstood or even not considered. For example, studies on the human dimension in conservation list numerous integrated or combined disciplines (e.g. environmental anthropology, environmental sociology, human-environment geography, environmental humanities). However, they fail to acknowledge ethnobiology as a key discipline within the domain of social conservation science (Bennett et al., 2017). This suggests that ethnobiology needs further development to engage with the biological and mathematical sciences. To achieve this, ethnobiology must embrace the theory-inspired and hypothesis-driven approach we recently advocated (McClatchey et al., 1999;Gaoue et al., 2017) along with several other authors spanning more than half a century (Ford, 1978;Bennett, 2005;Martin, 2007;Albuquerque, 2009;Albuquerque & Hanazaki, 2009;Hurrell & Albuquerque, 2012;Albuquerque et al., 2019a). Moreover, to improve methodological rigour, it is important that ethnobiology also embraces advanced methodological approaches, particularly new statistical modelling methods without sacrificing the strength of qualitative methods that provide important contexts.
Ethnobiology is the study of the interactions between people, organisms, and culture. Because of its interdisciplinary nature, most ethnobiologists are either trained biologists who are comfortable conducting interviews and asking questions about plants/animals or environmental use, or trained anthropologists who are interested in learning about, identifying and cataloguing plant or animal species. This double origin of ethnobiologists is consistent with the hybrid methodological culture of this discipline and has led to some of the methodological challenges that it now faces. Ethnobiologists have been exposed to a diversity of fields and analytical methods of inquiry for several decades. These methods are well established in series of excellent books (e.g. Alexiades, 1996;Cunningham, 2001;Martin, 2007;Albuquerque et al., 2014) and are regularly used in this field. However, most ethnobiological studies tend to include more tables than figures and lack statistical rigour. The seminal papers of Phillips & Gentry (1993a,b) led to several studies seeking to test hypotheses using classic statistical methods such as one-way analysis of a variance, Student t-test and chi-squared. However, most quantitative methods in ethnobiology were limited to metrics such as species importance indices (Phillips & Gentry, 1993a,b;Hoffman & Gallaher, 2007). With the development of new statistical methods such as classification, ordination, and general linear models, tremendous progress has been made in ethnobiology (Moerman, 1979;Voeks & Leony, 2004). Begossi (1996) introduced the well-known ecological methods of diversity rarefaction techniques which gained much support in ethnobotanical studies. As a result, several studies on the diversity of local knowledge of plant use have conducted rarefaction analyses to estimate plant knowledge richness (Colwell, Mao & Chang, 2004). Despite this progress, most quantitative ethnobiological methods are used without a clear conceptual framework, constraining our understanding of the mechanisms behind plant use and selection. Recently Gaoue et al. (2017) proposed a synthesis of 17 testable hypotheses in ethnobotany and thereby built a foundation upon which advanced statistical methods can be applied to enhance ethnobiology as a scientific discipline. There is now a small but increasing number of hypothesis-driven ethnobiological studies (Almeida et al., 2005;Alencar et al., 2010;Soldati & Albuquerque, 2012;Hart et al., 2017;Seyler et al., 2019;Bond & Gaoue, 2020;Muleba, Yessoufou & Rampedi, 2021).
Because most Ph.D graduates are more likely to find careers in industry rather than tenure-track academic positions, and the ethnobiologists that do find academic jobs tend to join life sciences or ecology and evolutionary biology departments, it is important that ethnobiological education includes a robust quantitative analysis toolkit (Beans, 2018). This toolkit must go beyond classical statistical analyses to include advanced statistical modelling tools. An understanding of these advanced analytical approaches will make ethnobiologists more competitive in the job market and also allow them to advance our understanding of how people select plants and animals for use, thereby shaping their environment in a non-random way. We postulate that the slow adoption of modern statistical tools in ethnobiology is due to the apparent complexity they present and the need to use syntax-based programming software such as R (R Development Core Team, 2019) to run these analyses. We also postulate that a clear presentation of the ethnobiological applicability of these advanced methods, practical examples using these methods to address ethnobiological research questions and well-commented seed R scripts implementing these methods will improve the likelihood that ethnobiologists embrace these new tools.
Herein, we synthesize methodological advances for hypothesis-driven research in ethnobiology. This includes how to test hypotheses in ethnobiology using (i) generalized linear mixed-effect modelling, (ii) structural equation modelling, (iii) phylogenetic generalized least squares, (iv) community phylogenetic tools, (v) social network analysis, (vi) species distribution modelling and spatial ecological tools, and (vii) predictive modelling. For each quantitative method, we include a brief presentation of the methods and its goals, why/how it can be used to test ethnobiological hypotheses and provide example or seed R scripts showing how it can be applied to real field data (see Codes R1-6 and associated data sets in Gaoue et al., 2021). We selected these quantitative methods for two main reasons. First, these methods are commonly used in the biological sciences, and familiarity with these methods will allow ethnobiologists to communicate their science better to a broader audience. Second, these methods are particularly useful to test ethnobiological hypotheses Albuquerque et al., 2019b), enabling ethnobiologists to avoid common statistical dataanalysis errors.

II. OVERVIEW OF THE ADVANCED QUANTITATIVE METHODS
Gaining a mechanistic understanding of why local people select plants or animals for use is a major question of interest for ethnobiologists Albuquerque et al., 2019a). In particular, understanding how and why plant/animal traits and human traits synergistically affect local people's knowledge and use of plants is at the core of ethnobotany inquiries. Several theoretical and empirical studies suggest the likelihood that a given plant species or family is selected and used as medicine depends on the socio-demographic traits of who is using the plant . For example, some plant species are used more by a given gender (Voeks & Leony, 2004) or age group (Albuquerque et al., 2011;Souto & Ticktin, 2012) than others. Medicinal and food plants are not selected randomly but according to their chemical traits (Moerman, 1979(Moerman, , 1996, and also indirectly for their physical traits (Ford & Gaoue, 2017). It is important to be able to investigate these features systematically to gain a mechanistic understanding of plant selection by human societies across historical and contemporary scales.
The data sets used to test these hypotheses often include species or participants as observations. To test hypotheses, one must take care not to violate the assumption of independent observations (Kenny & Judd, 1986;Romano & Kromrey, 2009). However, because plant species have evolutionary relationships and human participants can have kinship or social relationships or some form of spatial autocorrelation, it is critical to control for non-independent observations. Phylogenetic generalized least square (PGLS) methods can be used to control for such phylogenetic relatedness between plant species and confirm the importance of phylogeny in conferring medicinal value to plants Ernst et al., 2015;Yessoufou, Daru & Muasya, 2015). Considering kinship by using subjects as random effects in generalized mixed-effect or hierarchical models can also be used to account for family or genealogical relatedness among participants.
Ethnobiology as a discipline should embrace the role of socio-relational linkages (Bodin, Crona & Ernstson, 2006;Bond & Gaoue, 2020) in the creation, transformation and transmission of knowledge . Of particular interest is the role of social influence, social capital, and homophily in shaping the dynamics of local ecological knowledge (Prell, 2012;Kendal et al., 2018;Kluger et al., 2020). Although social relations are at the core of community dynamics, these interactions are often neglected in ethnobotanical literature [but see Hopkins (2011) and Reyes-García et al. (2013b)] as strong drivers of knowledge and subsequent practice. It is critically important to control for the impact of such relational effects when testing the effect of other drivers of knowledge such as human demographic traits or species physical and chemical traits. Social network analysis can be used in ethnobiology to understand the drivers of plant or animal selection and use for livelihood, and also to assess knowledge transmission mechanisms (Gallois et al., 2018;Castiñeira Latorre, Canavero & Arim, 2020).
Separating correlation from causation is one of the most important discussions in biological and social sciences (Shipley, 2016). However, demonstrating causation is not possible using most common statistical methods, which only measure correlation. Finding evidence of causation is particularly challenging when working with multiple predictor variables. Multiple predictors are often rare in ethnobiological Biological Reviews 96 (2021) 2281-2303 © 2021 Cambridge Philosophical Society.
Statistical advances in ethnobiology studies, which sometimes use multiple one-way analysis of variance where multiple regression could have been valuable [but see Gaoue & Ticktin (2009) and Souto & Ticktin (2012)]. Structural equation modelling is a powerful technique that allows testing of networks of hypotheses, hence using multiple predictors, some of which may be correlated, to demonstrate causation (Shipley, 2003).
A critical component of hypothesis testing and theory building is testing predictions to identify generalizable principles. Most statistical analyses in ethnobiology focus on testing hypotheses by explaining patterns in data. However, a more direct approach to theory development is testing how well statistical models generalize to data that were not used to develop the model. This approach, known as predictive modelling, has been used primarily to develop species distribution models, which create species range maps under current and future climate conditions. However, predictive modelling approaches can be integrated with many other kinds of statistical models, including all models discussed in this review.

III. ADVANCED STATISTICAL METHODS FOR HYPOTHESIS TESTING IN ETHNOBIOLOGY
The first set of methods tests for significant effects of drivers of ethnobiological knowledge (generalized mixed effect-models and structural equation modelling). The second set addresses limitations in the way these analyses are used by correcting for phylogenetic relatedness or accounting for social and family ties or spatial links (phylogenetic generalized least squares and social network analysis). The third set evaluates how well models generalize to data that were not used to build the models (species distribution models and predictive modelling).
Most ethnobiological studies collect data with inherent multilevel structure such as villages, plots, communities, and taxonomic groups. For example, individuals interviewed in villages are generally chosen non-randomly, or within social groups where it is impossible to ensure independence among villages, and also among interviewees. Here, the study design presents a hierarchy with multiple levels from the individual to the neighbourhood, village, community and district. Each level of these structured data may impact the information delivered by an individual. Analysis of this type of data is feasible with multilevel models. Also, different ethnobiological studies are conducted in the same communities but vary temporally (time-series data). These data cannot be considered as independent since they are all from the same communities (pseudo-replication). Such data are, however, interesting to understand the dynamics of ethnobiological knowledge over time. Thus, a statistical control for time becomes necessary in the analysis of such time-series data to reveal the effects of time on ethnobiological knowledge generation dynamics within a community.
Multilevel models, unlike classical analysis of variance, incorporate both fixed and random effects, with variance components that can be partitioned according to the structure of the data (Pinheiro & Bates, 2000). Fixed effects are "factors whose levels are experimentally determined or whose interest lies in the specific effects of each level, such as effects of covariates, differences among treatments and interactions" (Bolker et al., 2009, p. 128). Random effects are "factors whose levels are sampled from a larger population, or whose interest lies in the variation among them rather than the specific effects of each level" (Bolker et al., 2009, p. 128).
Application -One of the major hypotheses in ethnobiology predicts that traditional knowledge will increase with age and in some cases, women will have more local ecological knowledge than men . This is one of the most tested hypotheses in ethnobiology, often indirectly because the primary goal of most studies is to demonstrate rather than partition local knowledge. To test this hypothesis, we can use data on the age, gender and educational level of participants selected randomly in different villages within random regions of the study area. The response variable is binary (knows/does not know) and captures the probability that a participant knows a given medicinal plant. Age, gender, and education level are the fixed terms while the village, region and the participant selected represent random terms. If the intention is to test the same hypothesis in three climatic regions, then a given region becomes a fixed factor. For these two scenarios, we used the package glmmTMB (Brooks et al., 2017) in R to fit multilevel models ( Fig. 1) since the response variable is binary and requires a generalized linear mixed-effects model (GLMM) with a binomial error structure (Crawley, 2013). We illustrate this and provide data and seed code to reproduce this analysis as Code R1 in Gaoue et al. (2021). A direct implication of multilevel data is the need for random effects models. The variables which stand as levels or random effects are frequently left out of data analyses when they should be included as random effects. In such cases, traditional linear models are less adequate than multilevel models which incorporate random effects. Multilevel models integrate different structures for random effects such as individual, crossed and nested structures as well as various types of models accounting for different levels of complexity including a random intercept, random slope, and both random intercept and slope models. For example, a nested random effect could arise from establishing plots as nested in sites (Wall et al., 2019) or from genera nested within a given plant family (Bebber et al., 2007). Another example is related to testing the theory of non-random medicinal plant selection (Moerman, 1979;Ford & Gaoue, 2017;Robles Arias et al., 2020;Muleba et al., 2021). Data used to test this theory include the total number of species per family (predictor variable) and the number of medicinal plant species per family (response variable). In this context, plant family can be used as a random term which will require a random intercept model. Here, a GLMM with a Poisson error structure is appropriate to analyse the data (Fig. 1).
Another advantage of multilevel models for ethnobotanical data analysis is the wide range of complex data that can be handled. The dependent or response variables can be measurements, counts, binomial, multinomial, or ordinal. Multilevel models offer more parameters to specify highly complex families such as zero-inflated or one-inflated distributions. Many R packages are available to fit multilevel models including lme4 (Bates et al., 2015), nlme (Pinheiro & Bates, 2000), glmmTMB (Brooks et al., 2017), glmmADMB (Bolker et al., 2012), MCMCglmm (Hadfield, 2010) and brms (Bürkner, 2017).

(2) Structural equation modelling (SEM)
Ethnobiological motivation -Structural equation modelling (SEM) is a statistical method that can help elucidate the mechanisms underlying plant selection (Loken, 2009). SEM fits a network of hypotheses to data and is well known in the social sciences for its ability to handle complex models (Moreira et al., 2016). SEM has three advantages over traditional statistical methods (Grace et al., 2010). First, unlike most of methods that measure correlation, SEM can assess causation (Lefcheck, 2016). Second, it enables the measurement of errors for correlated variables that are inherent to social science. Finally, SEM can deal with latent variables, i.e. non-observable quantities or factors underlying observed variables that are inherent to social science (Tarka, 2018). R syntax for a traditional generalized linear model (GLM) and a generalized linear mixed-effect model (GLMM) testing the effects of age, gender, and education on knowledge of plant use. We also include a second example of GLMM and associated R syntax fitting a multilevel model for an ethnobiological study testing the theory of non-random plant selection with a random intercept mixed-effect model.
Below we detail three steps to analyse the drivers of plant selection for medicinal purposes using SEM and major theories in ethnobiology.
Application: building a SEM -First, we define a metamodel describing how plant secundary chemistry and life forms can mediate the effect of resource availability on plant species use values (Fig. 2). This model is based on logical predictions from the resource availability hypothesis and the ecological apparency hypothesis. The ecological apparency hypothesis predicts that plants that are easily found by herbivores (e.g. long-lived species) should invest heavily in quantitative defences (e.g. tannins) that are effective against all herbivores (Feeny, 1976). By contrast, plants that are difficult to locate (short-lived herbaceous plants) should invest smaller amounts of resources in qualitative defences (e.g. alkaloids) that are effective against all but specialist herbivores. Given that qualitative defences are expensive to produce, short-lived species will be less likely to be used for medicinal purposes than longlived species. The resource availability hypothesis predicts that plant defensive capabilities are mediated by their capacity to replace lost tissue (Coley, Bryant & Chapin, 1985). Therefore, fast-growing plants should invest relatively little in defence and use mobile resources that can be removed rapidly via senescing tissue. By contrast, slow-growing plants, characteristic of low-resource environments, should invest more in chemical defence because plant tissue is more costly to replace. Therefore, long-lived (slow-growing) species will likely be used for medicinal purposes more than short-lived (fast-growing) species.
Second, we estimate the path coefficients. Path coefficients are the strength of the relationship between variables. For example, we predict that species use value will increase with the concentration of secondary compounds (a 22 ) and with increasing complexity of life form (a 12 ) captured here using the proxy of plant height (Fig. 2). This can be implemented using various packages including piecewiseSEM (Lefcheck, 2016), lavaan (Wahren, 1966) or brms (Bürkner, 2017) in R (R Development Core Team, 2019). Finally, to test if there is strong support for the resource availability hypothesis versus the ecological apparency hypothesis, ethnobiologists could compare the path coefficients that represent these two models. If ja 21 *a 22 j > ja 12 j, then one would conclude that there is stronger support for the resource availability hypothesis than the ecological apparency hypothesis in explaining the variation of species use value (Fig. 2).
Structural equation modelling can also be used to analyse the influence of age and gender on the dynamics of local ecological knowledge. Female gender, increasing age, illiteracy, and decreasing formal education are all positively correlated with medicinal plant knowledge (Voeks & Leony, 2004). Older people are expected to have more knowledge of useful plants (Souto & Ticktin, 2012). Women tend to know more species and their specific uses than men (Souto & Ticktin, 2012). The degree of illiteracy is higher in rural areas (Vandebroek & Balick, 2012;Reyes-García et al., 2013a). From these relationships, ethnobiologists can first build a metamodel ( Fig. 3A) and then estimate the path coefficients using R (Fig. 3B). One can then compare the estimated path coefficients to understand which pathway best explains the factors that shape plant knowledge. For example, the metamodel includes the following pathways: plant knowledge is shaped directly by people's age (b 11 ) or gender (b 32 ), or indirectly through the mediated effect of literacy which can be affected by age (b 41 *b 22 ) or gender (b 42 *b 22 ) (Fig. 3A). Similarly, urbanization can directly reduce plant knowledge (b 12 ) due to limited availability of plants to interact with or via the mediated effect of increased literacy (b 21 *b 22 ) which might be associated with a greater opportunity for schooling that might accompany urbanization (Fig. 3A).
Finally, we used the same data set to illustrate how SEM can be implemented in R ( Fig. 4; Code R2 in Gaoue et al., 2021). We used a simplified version of the metamodel linking knowledge, age, gender, and degree of urbanization ( Fig. 3A). Here knowledge is binary and represents the ability of a person to know or not know the importance of plants. Similarly, gender (male or female) is binary. The degree of urbanization is the distance between each participant and the closest large city. We used the package piecewiseSEM to fit the model (Fig. 4). Because our endogenous variables are binary, we used generalized linear models with binomial error distribution. The results of the model (Fig. 3B) show that local knowledge was strongly influenced by gender (β = 0.615 ± 1.078, P < 0.0001), age (β = 0.413 ± 0.026, P < 0.0001) and degree of urbanization (β = 0.1655 ± 0.0069, P < 0.05). These results suggest that local knowledge increases with age; the older someone is, the more they know about the importance of plants. Moreover, people in rural areas had a better knowledge of local plants than people in large cities. We infer from the SEM that age was the stronger driver ( Fig. 3B) since it has a larger path coefficient than the other two paths (the paths linking knowledge directly with urbanization and that linking knowledge with urbanization via a mediation effect of gender). (

3) Phylogenetic generalized least squares (PGLS)
Ethnobiological motivation -All plant species used in traditional ethnomedicine are evolutionarily related. Some plant families will be over-or under-utilized for medicinal purposes due to shared evolutionary traits including the presence or absence of high concentrations of secondary plant compounds (Moerman, 1979;Heinrich & Verpoorte, 2012;Ford & Gaoue, 2017;Souza, Williamson & Hawkins, 2018). Indeed, it has been demonstrated that plant family is a strong predictor of species use values (Phillips & Gentry, 1993a). Although these factors have been acknowledged by ethnobiologists and likely play a significant role in driving plant use and selection among human societies, many analyses of data that include plant taxa have failed to use statistical models that account for species shared evolutionary history. In ethnobiology, general linear models are in widespread use for analyses that consider plant taxa. However, the independence of residuals is a critical assumption of these models (Quinn, Keough & Reid, 2002;Mundry, 2014) that is often overlooked or violated where plant taxa are considered (Revell, 2010), potentially making these approaches misleading because they fail to control for species shared evolutionary history. We caution that PGLS should be used by ethnobiologists where plant taxa are included as observations . Examples include investigations of cultural keystone species (Coe & Gaoue, 2020) and the utilitarian redundancy model , which we highlight below. PGLS provides a mean to fit a linear regression to investigate the effect of a given predictor variable on a response variable while controlling for non-independence between observations due to shared phylogenetic history (Revell, 2010;Mundry, 2014). These models include different assumptions of character evolution (e.g. Brownian motion) and branch-length scaling parameters such as λ (Pagel, 1999;Revell, 2010;Mundry, 2014), which are important considerations as they can affect the results, and thereby impact the reliability of PGLS analyses (Revell, 2010;Mundry, 2014) and statistical inference in ethnobiology.
Application -The first step in using a PGLS model for ethnobiological data is to create a phylogeny of the medicinal   Fig. 3, to test hypotheses using ethnobiological data. This SEM was implemented using the R package piecewiseSEM which allows the inclusion of endogenous variables with non-normal residuals. In this specific case, we use binomial error for all endogenous variables that are binary.  (Fig. 5). It was constructed from a comprehensive phylogeny for vascular plants (Jin & Qian, 2019) consisting of 31389 tip labels and 31387 internal nodes. The resulting phylogenetic tree contains 62 tips and 59 internal nodes .
After the construction of a phylogenetic tree, PGLS models can be implemented in R using packages such as ape (Paradis et al., 2020). We recommend starting with a saturated model that includes all predictors followed by the development of subsequent nested model candidates created by sequentially removing one of the predictors. Use of the Akaike information criterion (AIC) will allow the selection of the best-fitting models (Crawley, 2013) by comparing models' ΔAIC (the difference in AIC between each model), and identifying the model with the lowest AIC. In this approach, models with ΔAIC less than 2 are considered as the best models with the greatest explanatory power.
Coe & Gaoue (2021) discuss PGLS model selection and the importance of controlling for species' shared evolutionary history using ethnobiological data. They investigated the applicability of the utilitarian redundancy model, an ethnobiological theory that provides a framework for understanding the effects of species' shared evolutionary history, therapeutic redundancy, use values and local preferences on medicinal species use pressure . The study analysed ethnobiological data collected from 30 participants during fieldwork (interviews, focus groups and plant collections, participant observations) in the Shipibo-Konibo community of Paoyhan located in the Ucayali river region of the Peruvian Amazon. Of particular interest,  demonstrate how failure to control for shared evolutionary history between medicinal species used in ethnomedicinal contexts in the Ucayali river region of the Peruvian Amazon rainforest may result in misleading conclusions. While previous studies (Ferreira Júnior et al., 2012) of the utilitarian redundancy model demonstrated that preference by local people was a strong predictor of medicinal species use pressure,  demonstrate that when controlling for phylogenetic history in predictive models, the effect of local preference on medicinal species use pressure depends more on therapeutic redundancy. Further, the authors demonstrate that the absence of phylogenetic correction in predictive models led to an erroneous conclusion that local preference alone was a significant predictor of medicinal species use pressure (Table 1). We provide the data (see also  and seed code to reproduce this analysis for PGLS and phylogenetic tree construction as Code R3 in Gaoue et al. (2021).

(4) Testing for a phylogenetic signal
Ethnobiological motivation -Alternative approaches to PGLS include testing for a phylogenetic signal in a given ethnobiological variable. This test can reveal whether closely related species share similar values for a variable due to their shared evolutionary history. For example, one may ask if closely related species tend to have similar use value in a given community or village. This approach has been used to uncover cross-cultural convergence of medicinal plant use as well as the phylogenetic underpinning of plant therapeutic value (Saslis-Lagoudakis et al., 2012;Haris Saslis-Lagoudakis et al., 2014;Yessoufou et al., 2015;Teixidor-Toneu, Jordan & Hawkins, 2018;Lei et al., 2020). Where there is a phylogenetic signal in use value, phylogenetic information could be used to predict the use value of a species for which we lack such information or to group species in different use value categories. Thus, a species belonging to a 'hot node' of high use-value species on a phylogeny has a high probability of also having a high use value (Fig. 6).
The ethnobiological variables at hand dictate the options of analyses to use. For example, when dealing with a continuous variable (e.g. use value, redundancy, use pressure, etc.), testing for a phylogenetic signal can be done with the phylosig function of the R package phytools (Revell, 2012). Ethnobiologists can estimate Blomberg's K value or Pagel's λ, which range from 0 (no phylogenetic signal) to 1 (strong phylogenetic signal) (Pagel, 1999;Blomberg, Garland & Ives, 2003). For Pagel's λ, the log likelihood ratio can be used to examine if λ is significantly greater than zero. To test for significance of the K test, values of the ethnobiological variables at hand can be reshuffled multiple times across the phylogeny to calculate whether the observed K value departs significantly or not from random expectations. Both λ and K tests can be implemented using the phylosig function in the R package phylotools.
However, for binary variables, e.g. medicinal status (medicinal versus non medicinal), it is more appropriate to use the D statistic (Fritz & Purvis, 2010). This can be calculated using the R function phylo.d implemented in the package caper (Orme et al., 2014). A value of D = 0 is indicative of a phylogenetic signal as expected under a Brownian model; D < 0 indicates a strong signal (in the scenario shown in Fig. 6, medicinal status is likely to be highly conserved and thus to have a strong signal); and D ≥ 1 indicates overdispersion (absence of a signal). The significance of the D value is estimated by comparison of the observed D with the distributions of scaled D from simulations.
The net relatedness index (NRI) and net taxon index (NTI) metrics from community phylogenetics (Webb et al., 2002)

Ma ns oa all iac ea
Ocim um camp echi anum

Statistical advances in ethnobiology
can also be used to test for a phylogenetic signal in a binary variable. Both NRI and NTI are used for sets of species, for example to investigate whether a set of medicinal plants in a community or village is phylogenetically more closely related than predicted at random (i.e. are medicinal plants a random subset of the total flora of a village?). One may also be interested in testing whether a set of plants, for example those with red sap, is more related than expected by chance. These metrics are very similar, with the only difference being that NRI assesses the phylogenetic signal deeper in the phylogeny while NTI estimates the signal towards the tips of the phylogeny. They can be calculated in the R package picante (Kembel et al., 2010) using the functions ses.mpd and ses.mntd, respectively. These metrics calculate a Z-value, where a positive value suggests that the set of species of interest is more closely related than expected (i.e. a phylogenetic signal is present); the significance level can be estimated by reshuffling species across the phylogeny n times (e.g. 100 times). These metrics have the advantage of accommodating binary variables as well as species abundance.
Application -To illustrate how to test for a phylogenetic signal in ethnobiological traits, we use the ethnobiological data collected by Coe & Gaoue (2021) on plants' use value, redundancy and use pressure discussed in Section III.3. The required phylogeny can be generated as explained in Section III.3, and then Code R4 in Gaoue et al. (2021) (see Fig. 7) can be used to calculate the K value. We found no evidence for a significant phylogenetic signal in use value (K = 0.21, P = 0.47), redundancy (K = 0.11, P = 0.79), and use pressure (K = 0.283, P = 0.33). The same R script can be used to calculate λ. To illustrate the use of NRI or NTI to test for a phylogenetic signal in presence/absence data, we can run the R script provided in Fig. 7 for the binary variable in our data (species preference). For this analysis NRI = 1.48 (P = 0.07; 1,000 randomizations) and NTI = 0.13 (P = 0.41), suggesting that the set of medicinal plants preferred by the local human community was not significantly clustered according to phylogeny (Fig. 8).
Note that this measure of phylogenetic signal differs from that inferred by PGLS models. PGLS models specifically account for phylogenetic effects within the context of a regression, which may not be related to the degree of phylogenetic signal in the response or predictor variables themselves (Revell, 2010). Thus, it is important to consider the context in which either method should be used and to interpret the results cautiously (Revell, 2010).

(5) Social network analysis and exponential random graph modelling (ERGM)
Ethnobiological motivation -Social network theory states that the flow of information through social networks is determined by the configuration of human relationships comprising the network (Granovetter, 1973). In the context of ethnobiology, social network theory hypothesizes that a person's knowledge is influenced by their location in their social network, the connectivity of that network, and the knowledge of other people in that network . Social networks can be analysed by representing people as nodes connected by ties or links which represent sharing of knowledge or biological materials (Barnes et al., 2017;Kluger et al., 2020). Social network analysis has been applied in ethnobiology to test two major groups of predictions. The first is based on the hypothesis that connections between nodes in a network are associated with how much knowledge each person has and how evenly distributed knowledge is across the network (Janssen et al., 2006). The second is based on the hypothesis that network structural patterns are the result of sociocultural processes which affect the formation of knowledge-sharing relationships (Lusher, Koskinen & Robins, 2013).
(a) Multivariate regression using social network connectivity to predict individual knowledge and knowledge distribution of networks At the level of individuals, nodes with greater connectivity (e.g. degree centrality) are predicted to have more knowledge (e.g. the number of plant uses known) because well-connected nodes have more nodes they can learn from. Several nodelevel predictions have been supported in previous ethnobiological studies (Hopkins, 2011;Calvet-Mir et al., 2012;Reyes-García et al., 2013b). However, we are unaware of any existing ethnobiological studies that test these predictions at the network level. An example network-level prediction is that networks with greater connectivity (e.g. higher density, lower diameter) have less variation in knowledge (e.g. the standard error of the number of plant uses known by each person in the network) because connectivity increases the spread of information. One potential reason for the lack of networklevel tests is the effort required to collect a data set with many networks. While this is possible in a single study (e.g. Reyes-García et al., 2013b), meta-analysis may be a simpler option.
Connectivity metrics can be calculated at node level (e.g. degree, broker, and betweenness centrality) or network level (e.g. density, centralization, clustering, modularity, and diameter). Multivariate regression techniques (such as GLMM) are ideal statistical tests for these predictions, since they allow multiple network metrics to be included, as well as potentially confounding variables. Confounding variables at node level include age, gender, education, and ethnicity (Hopkins, 2011;Calvet-Mir et al., 2012;Reyes-García et al., 2013b;Díaz-Reviriego et al., 2016), while variables such as biomedical access, distance from metropolitan areas, and plant availability may confound network-level analyses (Vandebroek et al., 2004;C amara-Leret et al., 2014).
(b) ERGMs using social learning processes to predict network configuration This second group of predictions tests mechanistic explanations for how nodes become connected and why some networks have more connectivity than others. Although there are many mechanisms of social learning, examples include homophily (learning from similar people, such as those of the same age, same family, or same gender), prestige (learning from people that are perceived as successful or knowledgeable), and rarity (learning from people with unusual information) (Kendal et al., 2018). Predictions from causal mechanisms can involve three types of variables: configuration variables, node connectivity variables, and node pair variables. For each type of variable, sample predictions based on the mechanisms of homophily, prestige, and rarity are provided in Table 2. Configuration variables test whether or not the configuration is more or less common in the observed network than in a randomly generated network with the same number of nodes and edges. Node connectivity variables are predicted to correlate with the number of edges corresponding to each node. Node pair variables are predicted to correlate with whether there is a link between each possible pair of nodes in the network.
Exponential random graph models (ERGMs) are the primary methods of testing these predictions, because a single ERGM can include configuration variables, node connectivity variables, and node pair variables, which controls for the confounding effects of each variable type (Lusher et al., 2013). However, several studies have used alternative methods to test predictions using only node pair variables, with interesting results (Salali et al., 2016;de Brito et al., 2019;Santoro, Chaves & Albuquerque, 2020). Several predictions involving all three types of variables have been supported in previous ethnobiological studies (Henrich & Broesch, 2011;Labeyrie et al., 2015;Thomas & Caillon, 2016;Bond & Gaoue, 2020).
(c) Future of social network analysis A weakness of most existing ethnobiological social network analyses is that they fail to explore how networks and learning  . There was no evidence for a phylogenetic signal (NTI = 0.13, P = 0.41). Red dots represent species with high use value (UV). See Fig. 7 and Code R4 in Gaoue et al. (2021) for details of the R script used. For illustration purposes we assembled the phylogeny here using the online Phylomatic tool with the angiosperm phylogenetic tree of Zanne et al. (2014) as a backbone. This yields a phylogeny which is exactly the same as that shown in Fig. 5. Species names are omitted from this phylogeny to highlight the result of the phylogenetic signal test. processes evolve over time (but see Violon, Thomas & Garine, 2016;Santoro et al., 2020). Although diachronic methods such as temporal ERGM exist, they have yet to be applied in ethnobiological studies (Kluger et al., 2020). Another emerging technique is network analysis for linked social and ecological systems (Barnes et al., 2017;Kluger et al., 2020). Although bipartite socio-ecological networks have provided information about the resilience of plant-use networks (C amara- Leret, Fortuna & Bascompte, 2019;Castiñeira Latorre et al., 2020), these approaches have yet to be used to test predictions about integrated socio-ecological networks of ethnobiological knowledge of useful plants.
(d) Application: constructing a network, calculating connectivity metrics, and running an ERGM The first step in social network analysis is to establish predictions based on social learning mechanisms and the sociocultural context of the study (for examples see Table 2). The second step is to define criteria for who will be included in the network; this can be challenging because social network boundaries are not always clear (Laumann, Marsden & Prensky, 1991). The third is to collect data from at least 90% of the network on who they share information or biological material with (Kolaczyk & Krivitsky, 2015), and additional data required to test each prediction (e.g. knowledge, Table 2. Example predictions of how social learning processes affect the formation of knowledge-sharing relationships. Sample predictions are based on the mechanisms of homophily, prestige, and rarity. Node connectivity variables are predicted to correlate with the number of edges each node has. Node pair variables are predicted to correlate with whether there is a link between each possible pair of nodes in the network. In ERGM, similarity of continuous variables is indicated by a negative coefficient; positive coefficients indicate dissimilarity. Terms are included for exponential random graph modelling (ERGM) in the ergm R package Visual representation ERGM term Predicted sign of ERGM coefficient and explanation

Statistical advances in ethnobiology
demographics, family relationships). The fourth step is the creation of an adjacency matrix or edgelist csv file and a node-attribute csv file (Hoffman, 2020). These can be loaded into R using the sna (Butts, 2008b), network (Butts, 2008a), or igraph (Csardi & Nepusz, 2006) packages to create a network and add attributes to nodes in network. Finally, connectivity metrics are calculated using the sna (Butts, 2008b) or igraph (Csardi & Nepusz, 2006) packages. ERGMs are commonly  Table 2. Data used in this example are available as Code R5 in Gaoue et al. (2021).
fitted in the ergm R package, then evaluated using goodnessof-fit  and degeneracy . Like GLMM, SEM, and PGLS, ERGMs can be compared using the Akaike information criterion (AIC) (Crawley, 2013) for each model. To guide use of social network analysis and ERGM for ethnobiological data we provide data and seed code (see Fig. 9) as Code R5 in Gaoue et al. (2021). For more information, the reader is referred to online tutorials (e.g. Morris et al., 2019;Hoffman, 2020).

(6) Species distribution modelling (SDM)
Ethnobiological motivation -The resource availability hypothesis in ethnobiology predicts that plants are used medicinally because they are locally accessible or have a higher abundance (Voeks, 2004;Lucena, Lima Araújo & Albuquerque, 2007;Hart et al., 2017). One of the most accessible spatial tools that could be used to test this hypothesis is the species distribution model (SDM). While SDMs do not quantify abundance, they can be used to estimate a maximum likelihood surface of species occurrences using environmental data and presences or absences of an organism. For example, SDMs allow for correlation analysis to be performed using environmental variables and species occurrence data. In this context, these data can be used to identify areas of climatic suitability or unsuitability for a species currently or projected into the future under different climate change scenarios. Quantifications of environmental niches from SDMs have multiple uses in ecology and conservation, including predicting distances between patches of metapopulations, species assemblage modelling to estimate diversity and community composition (Guisan & Thuiller, 2005), and hypothesis testing (Anderson, Peterson & G omez-Laverde, 2002). Combined SDMs can be used not only to locate new populations or suitable habitats for potential relocations of culturally important plants, but also as a flexible and adaptable tool to test ethnobotanical hypotheses such as the resource availability hypothesis or ecological apparency hypothesis in novel ways (Dai et al., 2017;Bond et al., 2019). The temporal component of SDMs can be used to address drivers of change in biodiversity of ecological communities in an ethnobiological context. For example, a framework has been proposed to amalgamate species occurrences, environmental and geographic space, and time from heterogeneous data sources to infer essential biodiversity structure and function (Jetz et al., 2019). This conceptual framework could be used to support or refute hypotheses regarding the impacts of plant selection by human societies. In particular, the cultural keystone species theory suggests that there are essential plant species that, among other elements, are unique and irreplaceable to human communities (Garibaldi & Turner, 2004). It is challenging to discriminate between the relative values of different species to a community, but this framework can help simplify this process as well as reveal quantitative insights about the roles people play in ecological systems.
Optimal foraging theory, rooted in ecology, predicts there is a trade-off between plant value and effort spent foraging or processing related (time and distance) (Sih & Christensen, 2001). One of the greatest limitations to testing optimal foraging theory in ethnobiology is the lack of geographic diversity in existing spatial analyses . With an increase in remote-sensing data availability and quality, coupled with new spatial analysis methodologies, the potential for testing across different regional and cultural contexts is expanding. For example, when interviewing community members to estimate distances travelled to harvest plants, distance becomes a subjective measure that varies with plant, accessibility, abundance, and movement path. While distances travelled to harvest plants can be measured in the field, researchers can model travel data using Lévy distributions (Rhee et al., 2011) or custom-made movement models linking environmental variables to paths (McClintock et al., 2012). These models can then be tested for informative quantities such as straightness (McCarthy et al., 2010), first-passage time (Fauchald & Tveraa, 2003), or more complex fractal dimensions of movement times (Namazi, 2017). These quantitative techniques could be useful and versatile tools for testing optimal foraging theory in different spatiotemporal and cultural contexts.
It is also common for traditional ecological knowledge (Berkes, Colding & Folke, 2000) to encompass correlations between plant locations and plant use associated with environmental or abiotic variables. For example, an ethnobiological survey of indigenous communities in Nepal found an increase in species richness of medicinal plants gathered and used at higher altitudes (Kunwar & Bussmann, 2008). There also are ethnobiological hypotheses linked to plant phytochemistry, such as the diversification hypothesis which predicts that local communities introduce species to increase the diversity of medicines available to treat various ailments (Alencar et al., 2010). Incorporating a spatial component into such questions allows researchers to investigate them using SDM: for example, how does the medicinal quality of plant species vary across space? A potential tool that could be used to address this question is genomic spatial modelling, which quantifies population-level biodiversity across landscapes (Fitzpatrick & Keller, 2015). Entire genomes of plant species have been sequenced and analysed for potentially adaptive single nucleotide polymorphisms (SNPs) linked to environmental variables. For SNPs linked to plant phytochemistry, spatially or environmentally correlated medicinal plant quality as a function of genomic similarity can be projected and tested.
Conceptualizing and testing how, why, and which resource values of communities are distributed in geographic and environmental space can also be facilitated through adaptation of a suite of models by inVEST (Sharp et al., 2018). If medicinal plants are contextually designated as provisional ecosystem services, questions regarding the balance between population supply, human value, and human consumption can be addressed. InVEST could be particularly useful for accounting for the complexity of interaction dynamics and economics in ethnobiological systems, such as spatial variation of medicinal plant value as a function of culturally important sites or other communityspecific cultural metrics.

(7) Predictive modelling
Ethnobiological motivation -One of the greatest challenges across biological and cultural science is identifying generalizable principles of socio-ecological systems and understanding their predictive power (Linquist et al., 2016;Hofman, Sharma & Watts, 2017;Magliocca et al., 2018;Forey & Linquist, 2020). While these insights can be gathered through reviews or meta-analyses, they can also be generated using predictive models (Kuhn & Johnson, 2013). Predictive models predict future or unseen events by understanding the patterns in existing data. By contrast, descriptive models explain the nature and cause of patterns in existing data.
Although the techniques presented herein are mostly descriptive models (except for SDMs), any model can all be adapted to produce predictive insights. Key to predictive models is train/test splitting, which randomly splits data into two sets using predefined percentages (Kuhn & Johnson, 2013). The larger data set (the training set) is used to train the model, while the smaller data set (the testing set) evaluates how well the model generalizes to predict new data. To reduce possible bias in the testing data, the train/ test split is typically performed in multiple rounds so that each part of the data is used to test the model in at least one round. The evaluation metrics from each round then are combined (e.g. averaged); this robust evaluation method is called cross validation (Kuhn & Johnson, 2013).
Comparing model performance on the training and testing sets ensures that predictive models avoid two problems encountered by all statistical models: overfitting and the inclusion of non-causal variables (Lever, Krzywinski & Altman, 2016). Predictive models with evaluation statistics that are much better for training data than testing data indicate overfitting, i.e. a model that matches the training data so precisely that it fails to capture the overall data trend in a generalizable way (Quinn et al., 2002). Overfitted predictive models can be adjusted and re-run; the model with the best test data evaluation statistics representing the best (most generalizable) model. Predictive models can reduce overfitting in three ways: early stopping reduces the number of model iterations, regularization changes the number and effect size of predictor variables, and ensembling aggregates models with different levels of complexity to identify the overall data trend (Kuhn & Johnson, 2013).
The advantages of predictive modelling have led to breakthroughs in ecology, psychology, linguistics, molecular biology, and genetics (Jung, 2016;Weinstein et al., 2016Weinstein et al., , 2020Korotcov et al., 2017;Demirci et al., 2018;Jäger, 2018). However, we are unaware of any ethnobiological applications of predictive modelling, except for SDM. One possible reason for the lack of its widespread use in ethnobiology is that large sample sizes are required to ensure model accuracy and increase statistical power when using a train/test splita minimum of several hundred samples, but ideally several thousand samples (Varoquaux, 2018) are required for robust predictive models. A second potential reason may be that data scientists who are the primary users and developers of predictive modelling software packages primarily use Python instead of R, which tends to be used by academic scientists (Ozgur et al., 2017).
Application -Although predictive models are not currently used in ethnobiology (aside from SDM), they have many potential applications. The most common predictive models are regression models (both linear and non-linear) and decision tree regression (Kuhn & Johnson, 2013), for which train/test splits and cross validation are easily completed in R using the caret package (Kuhn, 2008). Predictive models can also be applied to data with phylogenetic (e.g. PGLS), grouping (e.g. GLMM), or spatial (e.g. SDM) structures, although the train/test split must be carefully planned to account for each dependence structure (Roberts et al., 2017). Predictive models can also be integrated with causal inference models (e.g. SEM) (Jacobucci, Grimm & McArdle, 2016;Kyriazos, 2018) or social network analysis (e.g. ERGM) (Shojaie, 2013;Pinter & Eisenstein, 2020). To illustrate the use of train/test splitting and cross validation for ethnobiological data we provide data and a seed code (see Fig. 10) as Code R6 in Gaoue et al. (2021). For more information, readers are referred to online tutorials (e.g. Dalpiaz, 2020).

IV. DISCUSSION
(1) Synthesis of new methodological approaches to hypothesis-driven ethnobiology Herein we present a set of advanced statistical modelling approaches and quantitative analysis tools that can be used in hypothesis-driven ethnobiology. Although theories and hypotheses in ethnobiology can be tested in various ways, some of these methods provide more detail or statistical power than others. For example, although GLMMs can control for some forms of non-independence of observations, they are not able to test the power of indirect or mediating effects to the extent that is possible with SEM. For example, mixed-effect models can test the hypothesis that local knowledge or use values increase with participant age and with years of residence in a region but are different for men and women. However, because age and residence time can be strongly correlated, the latter might need to be removed from the model; even if residence time is included, mixed-effect models will not be able to demonstrate a causal link between age and residence time. By contrast, using SEM provides causal evidence for how the effect of age on local knowledge or use values is mediated by residence time, and thus that age may not be relevant if it is unrelated to residence time and possibly to years of experience interacting with the local environment.
The use of SEM thus is powerful and should be embraced in ethnobiology to disentangle the effects of various drivers of knowledge but also to provide a mechanistic understanding of how these drivers influence plant knowledge and how that translates into the differential use of plants and ecological systems. Simple SEM using the lavaan library (Wahren, 1966) does not account for non-independence in observations that mixed-effect models effectively address. However more advanced piecewise SEM (Lefcheck, 2016) includes random effects and temporal or spatial correlation structures, all of which are likely to yield more informative results.
A unique advantage of SEM is the possibility to test and compare the strength of alternative pathways for the same global model. For example, the resource availability and ecological apparency hypotheses offer alternative explanations of why some plant lineages may produce a greater quality of medicinal products than others. The ecological apparency hypothesis proposes that more apparent species such as trees would produce a higher quality of medicinal products and consequently local people will disproportionately select woody species for medicinal purposes. The resource availability hypothesis also suggests that woody species are more likely to be selected but here the growth rate is the causal mechanism. SEM could be used to compare these pathways to identify which one has the greatest support.
The choice of which model to use requires balancing the study purpose and data availability. For example, although GLMMs can control for non-independent or hierarchical data (such as person and village or plant genus and family), they cannot test hypotheses about variables that are used as random effects. Testing hypotheses involving plant evolutionary relationships or human social relationships requires using community phylogenetic and social network analysis techniques, respectively. However, if genetic or social network data are not available, analysis will be limited to GLMMs.
SDM and other spatial analysis tools are rarely used in ethnobiology (but see Bond et al., 2019). However, SDM can play an important role in spatially quantifying resource knowledge and use, a step that is critical for priority planning and strategic resources management. In addition, to test the resource availability hypothesis or the ecological apparency hypothesis, estimating plant availability or apparency is a labour-intensive process using classical ecological surveying methods such as plot-based or transect-based density or abundance measures. The use of SDM can allow the rapid generation of estimates allowing for a more general test of such hypotheses (Dai et al., 2017).
(2) A call to embrace quantitative modelling without sacrificing qualitative approaches which establish context The definition of ethnobiology as the study of interactions between people, culture, and organisms has perhaps promoted a culture of simple documentation of these interactions. Several authors have critiqued this approach, emphasizing that although documenting interactions is valuable, understanding the processes and mechanisms of planthuman interactions will be essential to finding solutions for the Anthropocene crises (Bennett, 2005;Albuquerque & Hanazaki, 2009;Gaoue et al., 2017). We agree with this view and suggest that while qualitative approaches are an important part of ethnobiology, they must not be considered the primary focus of ethnobiological research. This evolution of ethnobiology is similar to the development of other disciplines, such as community ecology (Lawton, 1999;Turchin, 2001;Berryman, 2003). Early studies in ethnobiology focused on documenting patterns of plant/animalpeople interactions. Similarly, early studies in community ecology primarily focused on describing patterns of variation in vegetation to provide support for the organismic concept of communities (Clements, 1936) or the individualist continuum concept (Gleason, 1926). Current studies of community ecology instead focus on processes driving these patterns and understanding what factors promote species coexistence and therefore biological diversity (Vellend, 2010).
Ethnobiology must follow a similar progression with an increasing focus on studying not just the patterns but also the processes underpinning plant/animal-people interactions. This can be achieved by developing clear hypotheses; defining such theoretical arguments is vital to avoiding similar pitfalls to that of community ecology which resulted in Lawton (1999) suggesting that the discipline lacked a unifying conceptual framework. Currently, most ethnobotanical studies are collections of case-specific results without guiding principles. As a result, they fail to provide appropriate visualization in figures, focusing instead on text and tables. Embracing theory-inspired and hypothesis-driven approaches will advance ethnobiology, increasing its status and recognition among other disciplines (McClatchey et al., 1999;Gaoue et al., 2017;Gonçalves-Souza et al., 2019). Testing hypotheses using advanced statistical modelling techniques will also lead to an increasing use of figures to illustrate the results. Ethnobiology must embrace these quantitative tools without sacrificing the use of qualitative methods of inquiry which provide an invaluable context to quantitative results. Consistent with the Nagoya Protocol (Talaat, 2013;Teran, 2016), it is important to reiterate calls from other authors (Bussmann, 2019;Golan et al., 2019) for ethnobiologists to follow ethical protocols, obtain institutional review board (IRB) approval and undergo appropriate training prior to data collection. Under these principles, ethnobiological data collection must ensure that prior consent is obtained before interviews and should guarantee that the locations of culturally sensitive knowledge and culturally important species are protected from public access and that participant names are anonymized to ensure confidentially and the protection of indigenous rights during ethnobiological data collection (Vandebroek, 2017;McCune, 2018;Medinaceli, 2018).
(3) Training in quantitative skills for ethnobiologists Most statistical modelling techniques presented here are rarely included in biometry or biostatistics graduate courses taught in natural science departments (Touchon & McCoy, 2016). In addition, many ethnobiology programs have closed, suggesting a long-term decline of researchers in this field and fewer new graduate students. It thus may be challenging for students to gain access to necessary statistical courses and we advocate a creative approach to bringing this knowledge into the mainstream of ethnobiology.
Students can search for these courses in specialized departments at their universities, and training workshops focused on teaching these skills should be encouraged by ethnobiological professional societies. As a result of Covid-19, many workshops and courses are now available online, increasing access to all (Apostol, 2020;Sikora et al., 2020;Vandebroek et al., 2020) and there are many free tutorials and code snippets available online for R and Python. Learning these techniques may require creativity but their implementation has the potential to change the course of ethnobiology.
The role of mentoring and advising will be critically important to the future of ethnobiological research, both to ensure relevance and the use of approaches grounded in tradition. We urge advisors to encourage and support their students to seek and participate in courses and workshops that help build their quantitative skills. This will also benefit students in terms of their employability (Beans, 2018).

(4) Challenges for implementing hypothesis-driven methods in ethnobiology
In this review, we have presented examples of how methodological advances can be applied to test hypothesis-driven predictions in ethnobiology. Moving the ethnobiology field forward from simply describing plants, animals and their uses to a more in-depth hypothesis-testing discipline requires a paradigm shift amongst ethnobiologists to understand better how people interact with their environment. Testing these theories comes with additional challenges, particularly with data availability (Kindsvater et al., 2018) and the decline in taxonomic expertise (Terlizzi et al., 2003;Costello, May & Stork, 2013). Ethnobiological data were historically scattered across different repositories, research institutions, publications, reports, and personal research databases, making wider access challenging. Recent progress in the creation of centralized online databases such as Ewé (Souza & Hawkins, 2020) will facilitate the use of large-scale data sets and hypothesis-driven meta-analyses. Additionally, new avenues should be explored to collect data, including emerging technologies such as machine learning, remote sensing, and citizen science.
Most ethnobiological studies include large lists of plant or animal species. Accurate taxonomic identification will be directly affected by the decline in numbers of trained taxonomists. Clearly, inaccurate taxonomy will lead to erroneous conclusions even with the best statistical model. The declining numbers of trained taxonomists is primarily a result of the lack of priority given to the study of taxonomy and systematics in many university curricula, due in part to insufficient funding, and the increasing limitations of national and international legislation, which hinders taxonomic research explorations (Gaston & May, 1992). We highlight the importance of mentoring and training of taxonomists, especially from less developed countries, and of infrastructural development in taxonomy in developed countries with large collection of specimens (e.g. Royal Botanic Gardens, Kew, Natural History Museum, Missouri Botanical Garden, etc.) that still have practicing taxonomists.

V. CONCLUSIONS
(1) Ethnobiology is the study of people's interactions with the organisms in their environment. This includes both plants (ethnobotany) and animals (ethnozoology) but also local people's understanding of the ecological patterns and processes in their environment (ethnoecology) and how they develop traditional management strategies (Berkes et al., 2000;Gadgil et al., 2001) involving collective actions that arise from selforganized patterns (Ostrom, 2000) or in response to independent outside forces (Etkin, 1988(Etkin, , 2002. (2) The discipline has long been challenged to develop clear lines of investigation and theoretical foundations (Albuquerque & Hanazaki, 2009). This will be necessary for the advancement of this discipline and for the creation of educational and professional opportunities which are essential to its survival . Ethnobiology has evolved from simple descriptions of plant/animal use and knowledge to investigating how and why local people select the plants that they use. Recent developments include suggestions for theoretical investigations Albuquerque et al., 2019aAlbuquerque et al., , 2019b and empirical testing (Lucena et al., 2012;Nascimento et al., 2015;Ford & Gaoue, 2017;Hart et al., 2017;Seyler et al., 2019;Bond & Gaoue, 2020;Coe & Gaoue, 2020;Robles Arias et al., 2020;Muleba et al., 2021). (3) In this review, we synthesize the advanced statistical methods with which ethnobiologists can test theories and hypotheses, and propose that ethnobiologists embrace these advanced methods which are currently rarely used. Structural equation modelling, social network analysis and phylogenetics methods all provide opportunities to investigate new research questions. This will represent a step towards reconciling ethnobiology with cultural evolution, mathematical sociology, and community ecology.

ACKNOWLEDGEMENTS, AUTHOR CONTRIBUTIONS AND DATA AND CODE AVAILABILITY
Alison Cooper who provided critical comments on an earlier version.
Author contributions: O.G.G. conceived the study, its outline and led the writing, all other authors contributed writing of various sections. All authors edited and approved the final version.
Data and code availability: Extensive data sets and R scripts (Codes R1-6 and associated data sets) are available in Gaoue et al. (2021) [https://doi.org/10.5061/dryad. w6m905qph and https://doi.org/10.5281/zenodo.4726543] to implement the analyses discussed herein with detailed comments to guide users in running these codes and interpreting the results.