Published October 8, 2025
| Version A.20251008
Dataset
Open
Arca Verborum: A Global Lexical Database for Computational Historical Linguistics
Description
Arca Verborum is a multi-source lexical database for computational historical linguistics. Series A provides analysis-ready comparative wordlist data from 149 Lexibank datasets (List et al., 2022), containing over 2.9 million lexical forms across 9,700+ languages.
While CLDF's normalized structure is excellent for data integrity, it requires significant preprocessing before analysis. Series A provides denormalized, pre-joined CSV files for immediate use in research and education.
This release includes three collections: Full (all 149 datasets), Core (13 curated datasets for teaching), and CoreCog (58 datasets with expert cognate judgments).
**Citation:** If you use this dataset, you must cite both Arca Verborum and the Lexibank project (List et al., 2022, DOI: 10.1038/s41597-022-01432-0). See DATASET_DESCRIPTION.md in the archive for complete citation information.
Key features:
- Denormalized CSV files with pre-joined metadata
- 95% Glottolog coverage, 85% Concepticon coverage
- Aggregated cognate judgments from 81 datasets
- Comprehensive bibliographic references
- Quality validation reports
See DATASET_DESCRIPTION.md in the archive for complete documentation.
Files
arcaverborum.A.core.20251008.zip
Additional details
Related works
- Is derived from
- Journal article: 10.1038/s41597-022-01432-0 (DOI)