Parallel text typology dataset

Östling, Robert; Kurfalı, Murathan

doi:10.5281/zenodo.7506220

Published January 5, 2023 | Version 1.0.0

Dataset Open

Parallel text typology dataset

1. Department of Linguistics, Stockholm University

This repository contains data accompanying the following paper:

Neural models can sometimes discover typological generalizations. Computational Linguistics (2023) 49 (4): 1003–1051. https://doi.org/10.1162/coli_a_00491

It contains the following information for 1295 different languages:

language vector representations from a range of neural models
automatically derived lists of affixes
automatically derived lists of inflectional paradigms
typological features derived from annotation projection, and statistics on dependency relations
typological features derived from classifiers trained on language vectors and typological databases
automatically derived word lists
data needed for automatic evaluation of language representations (code in separate repository)

Note that the multilingual word embeddings described in the paper are very large, and therefore distributed in a separate public repository.

Notes

The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at C3SE partially funded by the Swedish Research Council through grant agreement no. 2018-05973. This work was funded in part by the Swedish Research Council through grant agreement no. 2019-04129.

Files

Files (58.1 MB)

Name	Size	Download all
parallel-text-typology.tar.gz md5:38e65961aec6b0213b38cb2f989045e6	58.1 MB	Download

	All versions	This version
Views	457	454
Downloads	117	115
Data volume	7.4 GB	7.3 GB

Parallel text typology dataset

Authors/Creators

Description

Notes

Files

Files (58.1 MB)