Published October 8, 2025 | Version A.20251008
Dataset Open

Arca Verborum: A Global Lexical Database for Computational Historical Linguistics

  • 1. Department of Linguistics and Philology, Uppsala University

Description

Arca Verborum is a multi-source lexical database for computational historical linguistics. Series A provides analysis-ready comparative wordlist data from 149 Lexibank datasets (List et al., 2022), containing over 2.9 million lexical forms across 9,700+ languages. While CLDF's normalized structure is excellent for data integrity, it requires significant preprocessing before analysis. Series A provides denormalized, pre-joined CSV files for immediate use in research and education. This release includes three collections: Full (all 149 datasets), Core (13 curated datasets for teaching), and CoreCog (58 datasets with expert cognate judgments). **Citation:** If you use this dataset, you must cite both Arca Verborum and the Lexibank project (List et al., 2022, DOI: 10.1038/s41597-022-01432-0). See DATASET_DESCRIPTION.md in the archive for complete citation information. Key features: - Denormalized CSV files with pre-joined metadata - 95% Glottolog coverage, 85% Concepticon coverage - Aggregated cognate judgments from 81 datasets - Comprehensive bibliographic references - Quality validation reports See DATASET_DESCRIPTION.md in the archive for complete documentation.

Files

arcaverborum.A.core.20251008.zip

Files (106.9 MB)

Name Size Download all
md5:3160850555007d0499e84c693b5f9afe
7.0 MB Preview Download
md5:e05d8ed1f4db262e26a751caa4e98ef1
11.5 MB Preview Download
md5:70e9f048380d3484f377594f03e9b530
88.4 MB Preview Download

Additional details

Related works

Is derived from
Journal article: 10.1038/s41597-022-01432-0 (DOI)