Journal article Open Access

Automated identification of borrowings in multilingual wordlists

List, Johann-Mattis; Forkel, Robert

Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas.

Files (5.9 MB)
Name Size
5.9 MB Download
  • Amigó E, Gonzalo J, Artiles J (2009). A Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints. Inf Retrieval. doi:10.1007/s10791-008-9066-8

  • Anderson C, Tresoldi T, Chacon TC (2018). A Cross-Linguistic Database of Phonetic Transcription Systems. Yearb Pozn Linguist Meet. doi:10.2478/yplm-2018-0002

  • Baxter WH (1992). A Handbook of Old Chinese Phonology. doi:10.1515/9783110857085

  • (1964). Hànyǔ Fāngyán Cíhuì 汉语方言词汇 [Chinese dialect vocabularies].

  • (1962). Hànyǔ Fāngyīn Zìhuì 漢語方音字彙 [Chinese dialect character pronunciation list].

  • Bodt TA, List JM (2022). Reflex Prediction. a Case Study of Western Kho-Bwa. Diachronica. doi:10.1075/dia.20009.bod

  • Castro A, Crook B, Flaming R (2010). A Sociolinguistic Survey of Kua-Nsi and Related Yi Varieties in Heqing County, Yunnan Province, China.

  • Castro A, Hansen B (2010). Hongshui He Zhuang Dialect Intelligibility Survey.

  • Castro A, Pan X (2015). Sui Dialect Research.

  • Cathcart C, Carling G, Larsson F (2018). Areal Pressure in Grammatical Evolution. An Indo-European Case Study. Diachronica. doi:10.1075/

  • Chén Q (2012). Miàoyáo Yǔwén.

  • Forkel R (2021). CLDFViz. A python library providing tools to visualize data from CLDF datasets [Computer software, Version 0.5]. doi:10.5281/zenodo.5221554

  • Forkel R, List JM (2020). CLDFBench. Give Your Cross-Linguistic Data a Lift.

  • Forkel R, List JM, Greenhill SJ (2018). Cross-Linguistic Data Formats, Advancing Data Sharing and Re-Use in Comparative Linguistics. Sci Data. doi:10.1038/sdata.2018.205

  • Hammarström H, Haspelmath M, Forkel R (2021). Glottolog. Version 4.4. doi:10.5281/zenodo.4761960

  • Hantgan A, Babiker H, List JML (2022). First steps towards the detecion of contact layers in Bangime: A multi-disciplinary, computer-assisted approach. [version 1; peer review: awaiting peer review]. Open Research Europe. doi:10.12688/openreseurope.14339.1

  • Hantgan A, List JM (null). Bangime: Secret Language, Language Isolate, or Language Island?.

  • Haspelmath M, Tadmor U (2009). Loanwords in the World's Languages: A Comparative Handbook.

  • Hill N, List JM (2017). Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting. doi:10.1515/yplm-2017-0003

  • Hóu J (2004). Xiàndài Hànyǔ Fāngyán Yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects].

  • (1999). Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet.

  • Kurpaska M (2010). Chinese language(s). A look through the prism of The Great Dictionary of Modern Chinese Dialects.

  • Lǐ R (2002). 李荣. Xiàndài Hànyǔ fāngyán dà cídiǎn 现代汉语方言大词典 [The great dictionary of modern Chinese dialects].

  • List JM (2012a). LexStat. Automatic Detection of Cognates in Multilingual Wordlists.

  • List JM (2012b). SCA. Phonetic alignments based on sound classes. doi:10.1007/978-3-642-31467-4_3

  • List JM (2014). Sequence Comparison in Historical Linguistics. doi:10.5281/zenodo.11879

  • List JM (2015). Network Perspectives on Chinese Dialect History. Bulletin of Chinese Linguistics.

  • List JM (2016). Beyond Cognacy: Historical Relations Between Words and Their Implication for Phylogenetic Reconstruction. J Lang Evol. doi:10.1093/jole/lzw006

  • List JM (2017). A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets. doi:10.18653/v1/E17-3003

  • List JM (2019a). Automated methods for the investigation of language contact with a focus on lexical borrowing. Lang Linguist Compass. doi:10.1111/lnc3.12355

  • List JM (2019b). Automatic Inference of Sound Correspondence Patterns Across Multiple Languages. Comput Linguist. doi:10.1162/coli_a_00344

  • List JM (2019c). Die Bedeutung der Grundline. [The importance of the baseline]. Von Wörtern und Bäumen.

  • List JM (2021). EDICTOR. a Web-Based Tool for Creating, Editing, and Publishing Etymological Datasets. Version 2.0.0.

  • List JM, Anderson C, Tresoldi T (2021b). Cross-Linguistic Transcription Systems. Version 2.1.0. doi:10.5281/zenodo.3515744

  • List JM, Forkel R (2021a). LingRex: Linguistic Reconstruction with LingPy. doi:10.5281/zenodo.5000189

  • List JM, Forkel R (2021b). CLDF dataset accompanying List and Forkel's "Borrowing Detection in Multilingual Wordlists" from 2021. Zenodo.

  • List JM, Greenhill SJ, Anderson C (2018). CLICS : An Improved Database of Cross-Linguistic Colexifications Assembling Lexical Data with Help of Cross-Linguistic Data Formats. Linguistic Typology. doi:10.1515/lingty-2018-0010

  • List JM, Greenhill S, Gray RD (2017). The potential of automatic word comparison for historical linguistics. PLoS One. doi:10.1371/journal.pone.0170046

  • List JM, Lopez P, Bapteste E (2016). Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists. doi:10.18653/v1/P16-2097

  • List JM, Nelson-Sathi S, Geisler H (2014a). Networks of Lexical Borrowing and Lateral Gene Transfer in Language and Genome Evolution. Bioessays. doi:10.1002/bies.201300096

  • List JM, Shijulal NS, Martin W (2014b). Using Phylogenetic Networks to Model Chinese Dialect History. Language Dynamics and Change. doi:10.1163/22105832-00402008

  • List JM, Rzymski C, Greenhill S (2021a). Concepticon. A Resource for the Linking of Concept Lists. Version 2.5.0 (version 2.5.0). doi:10.5281/zenodo.4911605

  • Liú L, Wáng H, Bǎi Y (2007). Xiàndài Hànyǔ Fāngyán Héxīncí, Tèzhēng Cíjí.

  • Mennecier P, Nerbonne J, Heyer E (2016). A Central Asian Language Survey. Language Dynamics and Change. doi:10.1163/22105832-00601015

  • Montenegro Á, Avis C, Weaver A (2008). Modeling the Prehistoric Arrival of the Sweet Potato in Polynesia. J Archaeol Sci. doi:10.1016/j.jas.2007.04.004

  • Nakhleh L, Ringe D, Warnow T (2005). Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages. Language. doi:10.1353/lan.2005.0078

  • Nelson-Sathi S, List JM, Geisler H (2011). Networks Uncover Hidden Lexical Borrowing in Indo-European Language Evolution. Proc Biol Sci. doi:10.1098/rspb.2010.1917

  • Onysko A (2019). Reconceptualizing Language Contact Phenomena as Cognitive Processes. doi:10.1515/9783110619430-002

  • Pritchard JK, Stephens M, Donnelly P (2000). Inference of Population Structure Using Multilocus Genotype Data. Genetics.

  • Prokić J, Wieling M, Nerbonne J (2009). Multiple sequence alignments in linguistics. doi:10.3115/1642049.1642052

  • Sagart L, Jacques G, Lai Y (2019). Dated Language Phylogenies Shed Light on the Ancestry of Sino-Tibetan. Proc Natl Acad Sci U S A. doi:10.1073/pnas.1817972116

  • Schleicher A (1863). Die Darwinsche Theorie Und Die Sprachwissenschaft: Offenes Sendschreiben an Herrn Dr. Ernst Haeckel.

  • Schweikhard NE, List JM (2020). Developing an annotation framework for word formation processes in comparative linguistics. SKASE Journal of Theoretical Linguistics. doi:10.17613/73w9-x654

  • Swadesh M (1952). Lexico-Statistic Dating of Prehistoric Ethnic Contacts: With Special Reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society.

  • Swadesh M (1955). Towards Greater Accuracy in Lexicostatistic Dating. International Journal of American Linguistics. doi:10.1086/464321

  • Szeto PY, Yurayong C (null). Establishing a Sprachbund in the Western Lingnan region: Conceptual and methodological issues. Folia Linguistica.

  • Tadmor U (2009). Loanwords in the World's Languages: Findings and Results. doi:10.1515/9783110218442.55

  • van der Ark R, Mennecier P, Nerbonne J (2007). Preliminary Identification of Language Groups and Loan Words in Central Asia.

  • Wang F (2004). Language Contact and Language Comparison. The Case of Bai.

  • Wu MS, List JM (2021). Annotating Cognates in Phylogenetic Studies of South-East Asian Languages. Humanities Commons. doi:10.17613/0v48-aa64

  • Wu MS, Schweikhard NE, Bodt TA (2020). Computer-Assisted Language Comparison: State of the Art. Journal of Open Humanities Data. doi:10.5334/johd.12

  • Zhang L, Manni F, Fabri R (null). Detecting Loan Words Computationally.