Published December 30, 2025 | Version 2025.1.9
Dataset Open

OpenITI: a Machine-Readable Corpus of Islamicate Texts

  • 1. ROR icon The Aga Khan University (International) in the United Kingdom
  • 2. ROR icon Universität Hamburg

Description

Co-PIs: Matthew Thomas Miller (University of Maryland, College Park), Maxim G. Romanov (University of Hamburg), Sarah Bowen Savant (Aga Khan University-ISMC, London).

The OpenITI corpus is a multi-lingual, machine-actionable and scholarly corpus of Islamicate texts that aims to provide a key component for rendering the Islamicate textual tradition accessible to computational analysis and to new forms of digital scholarship.

As indicated by its name, the OpenITI corpus forms part of the broader and multi-institutional efforts of the Open Islamicate Texts Initiative (OpenITI), which were first assembled into a corpus within the OpenArabic project, developed first at Tufts University (at The Perseus Project, 2013–2015) and then at Leipzig University (at the Alexander von Humboldt Chair for Digital Humanities, 2015–2017)—in both cases with the support and under the patronage of Prof. Gregory Crane. It aims to develop the digital infrastructure for the study of Islamicate cultures and to foster synergies and collaboration between computer science, digital humanities, and fields such as Islamic, Arabic, Persian, and Ottoman studies, and which is led by researchers at the Aga Khan University’s Institute for the Study of Muslim Civilisations (AKU-ISMC) in London, the University of Hamburg (UHH), Roshan Institute for Persian Studies at the University of Maryland (College Park), and an interdisciplinary advisory board of leading scholars from relevant fields and has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB (Grant Agreement No. 772989, PI Sarah Bowen Savant) and KITAB-Transform projects (Grant Agreement No. 101199672, PI Sarah Bowen Savant), the Mellon Foundation, and the Qatar National Library in addition to support from the researchers’ home institutions.

 

The present release of the OpenITI corpus contains all digital versions of a text that are available in the corpus (also available at the https://github.com/OpenITI/RELEASE). In case more than one version for a text exists, one of these versions will, however, be singled out as the primary version (referred to as 'PRI' in the metadata) while all the others will be categorised as secondary versions (referred to as 'SEC' in the metadata). OpenITI also releases a primary version of the OpenITI corpus consisting only of the primary versions which may be more convenient for most use cases.

Note on Release Numbering: Version 2019.1.1—where 2019 is the year of the release, the first dotted number—.1—is the ordinal release number in 2019, and the second dotted number—.1—is the overall release number; the first dotted number will reset every year, while the second one will continue on increasing.

For more details on this specific version see the release notes.

Note: In case of any issues with unzipping the files on Windows using built-in utilities, please use free softwares, such as WinRAR and 7zip.

 

 

Files

OpenITI_data_2025-1-9.zip

Files (5.9 GB)

Name Size Download all
md5:95cf19a9320fee6c37c4c26c9fa860b1
5.9 GB Preview Download
md5:cb2226f64264efa964df9ef659d40199
12.1 MB Download
md5:e127a3fccd2df033a6462542820f55a0
204.0 kB Preview Download

Additional details

Related works

Funding

European Commission
KITAB - Exploring Cultural Memory in the Pre-Modern Islamic World (700–1500): Knowledge, Information Technology, and the Arabic Book 772989