Statistical bias control in typology


 In this paper, we propose two new statistical controls for genealogical and areal bias in typological samples. Our test case being the effect of VO-order effect on affix position (prefixation vs. suffixation), we show how statistical modeling including a phylogenetic regression term (phylogenetic control) and a two-dimensional Gaussian Process (areal control) can be used to capture genealogical and areal effects in a large but unbalanced sample. We find that, once these biases are controlled for, VO-order has no effect on affix position. Another important finding, which is in line with previous studies, is that areal effects are as important as genealogical effects, emphasizing the importance of areal or contact control in typological studies built on language samples. On the other hand, we also show that strict probability sampling is not required with the statistical controls that we propose, as long as the sample is a variety sample large enough to cover different areas and families. This has the crucial practical consequence that it allows us to include as much of the available information as possible, without the need to artificially restrict the sample and potentially lose otherwise available information.

Introduction Controlling for bias in sampling has mostly focused on ways of including languages from as many genealogical groupings as possible, while ensuring that they are as unrelated as possible.
Different methods have been proposed in e.g. Bickel (2008), Dahl (2008), Dryer (1989Dryer ( , 2011, Jaeger et al. (2011), Maslova (2008), Perkins (1989), and Rijkhoff and Bakker (1998). Controlling for areal effects often relies on similar techniques: Choosing a sample of languages which are assumed to have as little contact with each other as possible. Most studies also try to balance the number of languages selected from each macroarea in some way (cf. Jaeger et al. 2011). These methods all face the same issue: The researcher can only include a portion of her data. We present an alternative approach that, using relatively recent statistical methods, can control for the types of biases mentioned without the need to exclude data.
Materials For illustration, we focus on the relation between verb-object orders and the preference for affix position in a language. It has been argued that, while VO orders can occur with both prefixes and suffixes, OV orders show a strong preference against prefixation (e.g. Bybee, Pagliuca, and Perkins 1990;Hawkins and Gilligan 1988;Siewierska and Bakker 1996). We use the datasets from WALS chapters 26 and 83 (Dryer 2013a,b).

Method
We fitted a series of Bayesian ordinal models with affix position as the dependent variable (7 levels: strongly suffixing to strongly prefixing) and with verb-object order as the predictor (3 levels: OV, no dominant order, and VO). All models were fitted with Stan (Carpenter et al. 2017) using the brms package (Bürkner 2018) in R. To control for family biases, we included a phylogenetic term (Housworth, Martins, and Lynch 2004) in our regression model. Unlike simple group-level effects, phylogenetic regression can take into account a complete phylogenetic tree, resulting in a gradient representation of genetic relations including all (known) genetic relations between languages in the sample. Assuming that closely-related languages are generally more likely to share a given linguistic feature than more distantly-related languages, the model estimates the effects to be more similar for languages closer in the phylogenetic tree, but less so for languages less close in the tree. For instance, Spanish, French, and Farsi are modeled as related, but with a much closer genetic relation between Spanish and French than between those two languages and Farsi. To control for areal bias, we include latitude and longitude information into our model using a two-dimensional Gaussian Process for each macro-area. This allows us to capture areal effects in a non-linear way across geographical areas, integrating the following issue pointed out by Cysouw, Dediu, and Moran (2012), Dryer (2018), Jaeger et al. (2011), andRijkhoff, Bakker, et al. (1993): Two languages spoken in areas like Siberia with a distance of 100km may still share properties due to contact, while languages that are spoken 100 km apart in New Guinea are less likely to have been in contact. Thus, our model can capture that distances across languages (a proxy for quantifying the amount of contact between them) do not have a uniform effect across the globe.

Results
We compared a model including the three controls described above to a hierarchical model (group-level effects for family and macroarea), and a no-controls model. Our model performed much better than the hierarchical model and the no-controls model in predicting the affix position. With regards to the association between verb-object order and affix position, our model confirmed a very mild effect of verb-object order on the preferred affix position of the language, with most of the variance being accounted for by family and areal effects. In contrast, the group-level effect model and the model without controls strongly overestimated the effect of verb-object order on affixation preferences.
Conclusion Our paper has two main points concerning sampling in typology: Firstly, we show how statistical bias control can offer an alternative to restricting a language sample and excluding otherwise available information. Secondly, our results show that areal bias, even if modeled in a very crude way, is at least as important as genetic bias and should be controlled for in any quantitative typological study exploring crosslinguistic distributions.