OpenITI-proc corpus

Belinkov, Yonatan; Magidow, Alexander; Barrón-Cedeño, Alberto; Shmidman, Avi; Romanov, Maxim

doi:10.1007/s10579-019-09460-w

Published January 8, 2019 | Version v1.0

Dataset Open

OpenITI-proc corpus

1. MIT Computer Science and Artificial Intelligence Laboratory
2. Department of Modern and Classical Languages and Literatures, University of Rhode Island
3. Università di Bologna
4. Department of Hebrew Literature, Bar-Ilan University
5. Department of History, University of Vienna

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. This is a large-scale historical corpus of the written Arabic language, spanning 1400 years. The corpus is processed with the Farasa Arabic NLP toolkit. The corpus can be used to study the history of the Arabic language.

Files

Files (16.4 GB)

Name	Size
OpenITI.lemmas.tar.bz2 md5:2e96a981346bce858603749f69b40298	1.9 GB	Download
OpenITI.parsetree.tar.bz2 md5:5f785bb54bd41aa45bbc18114821fb27	2.5 GB	Download
OpenITI.plain.tar.bz2 md5:c3a019a9eb86851a5a0f4c426d1838b2	2.5 GB	Download
OpenITI.pos.tar.bz2 md5:ff42ce52052b79d409c4fa24554cf5ca	2.6 GB	Download
OpenITI.segmentation.tar.bz2 md5:fd44907d385225f565546a7d106f1992	4.6 GB	Download
OpenITI.sentences.tar.bz2 md5:b5f9b9bfc5ccfd4037fec0106004620e	2.4 GB	Download

Additional details

Is documented by: https://link.springer.com/article/10.1007/s10579-019-09460-w (URL); https://arxiv.org/abs/1809.03891 (URL)
Is supplement to: https://openiti.github.io/ (URL)

Belinkov, Magidow, Alberto Barrón-Cedeño, Shmidman, and Romanov. Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

	All versions	This version
Views	1,065	1,061
Downloads	758	758
Data volume	2.5 TB	2.5 TB

Files (16.4 GB)

Related works

References

OpenITI-proc corpus

Authors/Creators

Description

Files

Files (16.4 GB)

Additional details

Related works

References