Published January 8, 2019 | Version v1.0
Dataset Open

OpenITI-proc corpus

  • 1. MIT Computer Science and Artificial Intelligence Laboratory
  • 2. Department of Modern and Classical Languages and Literatures, University of Rhode Island
  • 3. Università di Bologna
  • 4. Department of Hebrew Literature, Bar-Ilan University
  • 5. Department of History, University of Vienna

Description

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. This is a large-scale historical corpus of the written Arabic language, spanning 1400 years. The corpus is processed with the Farasa Arabic NLP toolkit. The corpus can be used to study the history of the Arabic language.

Files

Files (16.4 GB)

Name Size Download all
md5:2e96a981346bce858603749f69b40298
1.9 GB Download
md5:5f785bb54bd41aa45bbc18114821fb27
2.5 GB Download
md5:c3a019a9eb86851a5a0f4c426d1838b2
2.5 GB Download
md5:ff42ce52052b79d409c4fa24554cf5ca
2.6 GB Download
md5:fd44907d385225f565546a7d106f1992
4.6 GB Download
md5:b5f9b9bfc5ccfd4037fec0106004620e
2.4 GB Download

Additional details

References

  • Belinkov, Magidow, Alberto Barrón-Cedeño, Shmidman, and Romanov. Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus