Morphology data for 22 languages

Hämäläinen, Mika; Partanen, Niko; Rueter, Jack; Alnajjar, Khalid

doi:10.5281/zenodo.3928628

Published July 2, 2020 | Version 1.0

Dataset Open

Morphology data for 22 languages

1. University of Helsinki

Most people will want to download only the train_data.zip file. This contains folders named after language ISO codes. Under each language, you will find separate folders for the lemmatization, analysis and generation tasks. Each one of these folders has source and target files for train, val and test. There is also a pred.txt file that has the predictions of the baseline system. If you need more control over how these datasets are created, keep on reading.

To create the dataset from scratch, run the data_formatter.py. The morphological data is stored in the fst.zip, if you want to download that and skip some parts of making the data from scratch, extract it to the same folder as the data_formatter.py.

Cite:

Hämäläinen, M., Partanen, N., Rueter, J., & Alnajjar, K. (2021). Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021)

Files

fst.zip

Files (1.3 GB)

Name	Size	Download all
data_formatter.py md5:80de7f4214221272d1f698de9af5c168	16.9 kB	Download
fst.zip md5:7dd914bd54dc06d17356022a0dd57ec1	585.7 MB	Preview Download
requirements.txt md5:aa272dab17333f93d940d2d364c6123f	24 Bytes	Preview Download
slurm_template.sh md5:ce5606ff7b87fd79f0b42e116d49a650	904 Bytes	Download
slurmer.py md5:60df83e53c6fd9030a0bf847f2b5d92d	179 Bytes	Download
train_data.zip md5:2df7748243a80a3bf4f6d2408f928db9	664.9 MB	Preview Download
wer++.py md5:3510f0e066b41e76ce2dd40828e37604	15.2 kB	Download

	All versions	This version
Views	333	333
Downloads	286	286
Data volume	43.0 GB	43.0 GB

Morphology data for 22 languages

Authors/Creators

Description

Files

fst.zip

Files (1.3 GB)