Published January 8, 2019
| Version v1.0
Dataset
Open
OpenITI-proc corpus
Creators
- 1. MIT Computer Science and Artificial Intelligence Laboratory
- 2. Department of Modern and Classical Languages and Literatures, University of Rhode Island
- 3. Università di Bologna
- 4. Department of Hebrew Literature, Bar-Ilan University
- 5. Department of History, University of Vienna
Description
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. This is a large-scale historical corpus of the written Arabic language, spanning 1400 years. The corpus is processed with the Farasa Arabic NLP toolkit. The corpus can be used to study the history of the Arabic language.
Files
Files
(16.4 GB)
Name | Size | Download all |
---|---|---|
md5:2e96a981346bce858603749f69b40298
|
1.9 GB | Download |
md5:5f785bb54bd41aa45bbc18114821fb27
|
2.5 GB | Download |
md5:c3a019a9eb86851a5a0f4c426d1838b2
|
2.5 GB | Download |
md5:ff42ce52052b79d409c4fa24554cf5ca
|
2.6 GB | Download |
md5:fd44907d385225f565546a7d106f1992
|
4.6 GB | Download |
md5:b5f9b9bfc5ccfd4037fec0106004620e
|
2.4 GB | Download |
Additional details
Related works
- Is documented by
- https://link.springer.com/article/10.1007/s10579-019-09460-w (URL)
- https://arxiv.org/abs/1809.03891 (URL)
- Is supplement to
- https://openiti.github.io/ (URL)
References
- Belinkov, Magidow, Alberto Barrón-Cedeño, Shmidman, and Romanov. Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus