This page consists of two blocks. First, we deal with real data only and calculate the actual numbers of sharings between our East-Caucasian wordlists and two Azerbaijani varieties (A-Azerbaijani and D-Azerbaijani). Then, we apply a permutation test that is described in a corresponding section and create a new sample of wordlists based on our data but with pan-Daghestanian probabilities of attesting particular lexemes in the lists and test our real data against this gainst this sample.

1. Real Data

Boxplots (Fig. 2 in the paper)

Each point represents an elicitation and shows the name of the village where this elicitation was obtained. On the X-axis, elicitations are grouped by the native language of the consultant, together with a boxplot and the median for each group. By and large, there is a correlation of the amount of lexical borrowing with the language, with the Tsakhurs showing highest and Tabasaran showing lowest figures, and the Lezgians and Rutuls sitting in between. Within a language, the amount of loanwords seems to depend on the village. Thus, among the Lezgians, Gdym and Fiy speakers are located far higher than the median, and the Rutuls of Kiche are far lower. This correlates with the bilingualism rates shown in Table 1 above, whereby the villagers of Gdym and Fiy, living on the border with Azerbaijan, show an almost universal knowledge of Azerbaijani, while the knowledge of Azerbaijani among the villagers of Kiche was visibly lower than in other Rutul villages.

Note, however, that the level of bilingualism and mere distance to the Azerbaijani border are probably not reliable predictors per se. In Arkhit, the number of loans is similar to Khlyut, even though Arkhit is located in the Khiv district, the only one in our sample that does not immediately border an Azerbaijani-speaking area, while the Rutul district, where Khlyut is located, borders Azerbaijan. Similarly, in Dyubek, with vicinal bilingualism in Azerbaijani, the amount of borrowings is just as low as in the Tabasaran villages, where the knowledge of Azerbaijani was much lower, e.g. in Khiv and Laka (see Table 2). Daniel et al. (submitted) argue that the language is a much more likely lexical donor if it is used as a lingua franca. In the villages in our survey, Azerbaijani was indeed used as a lingua franca in the West (e.g. between Tsakhurs and Rutuls) but not in the East (e.g. between Tabasarans and Lezgians).

Clusterings

In this part, we explore the different definitions of what a quantitative proxy to the intensity of contact may be and whether it has any effect on the results. Since all clusterings provide essentially the same results, only the first one is presented in the paper.

Clustering by Proportion of A / D loans out of the total list size. (Fig. 3 in the paper).

In the plot below, we use the percentage of A- / D- loans fron the total list size, i.e. the number of successfuly collected lexical items for a given speaker.

Clustering by Proportion of A / D loans out of total Azeri loans

In the following plot, the number of A- / D- loanwords is divided by the total number of Turkic loans in a speaker. This way of counting may adjust the resulting numbers to the intensity of Turkic influence in each speaker as well as to the differences between speakers. Note that the results are essentially the same, only few dots are misattributed, so we stick to the first clustering.

The same clusterization, Std Azeri words REMOVED

The cluster below only uses the A- / D- loanwords that are absent in Standard Azerbaijani, so the counts are considerably lower mainly due to the fact that (a) Standard Azerbaijani is pan-dialectal and (b) Azerbaijani dialects are fairly similar lexically. The clustering works even better than the previous two (not a single mis-clusterized data point), but the proportions are extremely low, so we prefer to use the first clustering in the paper.

2. Permutation Tests

The plot below presents three separately run permutation tests.

Permutation Tests (Fig. 4 in the paper)

Content-wise, the simulation in the upper row confirms that the amount of sharings with A-Azerbaijani is significantly higher than random in the West and significantly lower than random in the East. Conversely, the amount of D-Azerbaijani loans is significantly lower in the West and significantly higher in the East. This is expected on the basis of geography and also suggested by the clusteringclusterization in Fig. 3.

The simulation in the second row confirms that A-Azerbaijani is much more present in Tsakhur elicitations than in other elicitiations in the West, which corresponds to the horizontal spread of the yellow cluster (the West) to Tsakhur elicitations in the right. As to the presence of D-Azerbaijani, there is no statistically significant difference but there may be a trend, corresponding to the slight cline of the cluster from the left (Lezgians) to the right (Tsakhurs).

Finally, the bottom row shows results of simulations run within the much more homogeneous blue cluster, the East. We see that Lezgian villages show a presence of A-Azerbaijani that is higher than random at the marginally significant level; but we cannot say that the presence of A-Azerbaijani in Tabasaran villages is significantly lower than expected at random. Conversely, the presence of D-Azerbaijani in the Tabasaran villages is significantly higher than random, while its presence in Lezgian villages in the same area is not significantly lower than random. Note also that the deviation from random expectations (i.e. how much the distribution is displaced by the X-axis relative to x = 0) is visibly smaller than in the other two rows even when this deviation is statistically significant. We can see this in Fig. 3, where the East is much more compact and less internally structured than the West.

References (R and R packages only)

Kassambara, Alboukadel. 2019. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.

Kassambara, Alboukadel, and Fabian Mundt. 2019. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Slowikowski, Kamil. 2019. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://CRAN.R-project.org/package=ggrepel.

Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.

———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Lionel Henry, Thomas Lin Pedersen, T Jake Luciani, Matthieu Decorde, and Vaudor Lise. 2020. Svglite: An ’Svg’ Graphics Device. https://CRAN.R-project.org/package=svglite.