Published April 7, 2026
| Version v5
Dataset
Open
Latin-transliterated Ottoman Turkish Corpus (LATOC)
Description
LATOC Corpus
LATOC (Latin-transliterated Ottoman Turkish Corpus) includes 143 Ottoman Turkish books, 13,252,350 words, written between the 15th and 20th centuries. The books were transliterated by domain experts and publicly shared on the Internet. The books in the corpus were automatically structured via a rule‑based approach and manually checked.
Due to the copyright restrictions, this repository does not have any raw text; however, it guides you to download the files, convert them into structured XML files, and process them on your computer to get LATOC. The pipeline provided here can be extended to new sources by simply acquiring the PDF documents and, if necessary, updating the `CERTAIN_RULES` variable in 2_preprocessing.py and the files `controversy_cache.json` and `exclude_pages.txt`.
Corpus Overview
The corpus has more than 13 million words from 143 works written between the 15th and 20th centuries.
While the pipeline standardizes these texts into the IJMES format, some inconsistencies, such as normalization of spelling, may persist. These are primarily inherited from the transliteration provided by the domain experts. This work does not apply any normalization to the spelling.
Arabic/Persian characters in the documents are filtered out, and only the Latin-transliterated text is preserved.
Each document is split into pages. Each page is divided into three segments: paragraph, comprising the main text, title, and footnote.
The final XML files also provide the coordinates of the regions on the page, like "bbox="277.0,711.1,651.2,765.0".
Data Files
Work‑level metadata (`LATOC_metadata_sample.csv`)
This is a sample of the metadata with further information. It provides metadata for 36 Dîvân works.
Note that this metadata is from the previous version of LATOC, which is the reason why it has only 36 works.
In the future, this scheme will be expanded to all works in LATOC.
Each Dîvân work is accompanied by:
- `file_name`
- `work_name` (title of the Dîvân)
- `pen_name` (mahlas)
- `real_name`
- `viaf`
- `century`
- `gender`
- `rank`
- e.g., “Sultan,” “Judiciary & Religious Office,” “High Bureaucracy/Military,”
“Scholars & Sufi Orders,” “Civil Bureaucracy,” “Lay/Non‑official”
Overall data statistics (`data_statistics.csv`)
This file includes basic statistics such as the word count per document. It also provides links to some of the documents you can download for the work. Note that some works miss the URL links to download documents here. The document names in the column 'file' should be enough for users to find the document.
Book‑level data
Since the data was under copyright, this repository does not have it directly. However, you can download the data from _Yazma Eserler via either here or this webpage and then run the Python scripts as explained in this document to have the processed data on your device.
Supplementary material (`controversy_cache.json` and `exclude_pages.txt`)
`controversy_cache.json` includes the conversion rules for the problematic characters, which might be converted into more than one character in the IJMES chart. Since each document behaves differently, it provides the conversion rule based on the document. You can add a new rule here for your new documents.
`exclude_pages.txt` has the page boundaries that should be deleted to remove the editorial preface, table of contents, and references, etc. You can enlarge this file if you add a new source to the data.
Processing the data with Python scripts
After downloading the files and storing them in a single folder, you should run the Python scripts 1_pdf_extractor.py, 2_preprocessing.py, and 3_xml_cleaner.py in turn. 2_preprocessing.py requires the supplementary file, `controversy_cache.json`. For 3_xml_cleaner, you need the supplementary document exclude_pages.txt. These files are prepared for 144 works presented in this dataset by the author.
Usage Notes
- The corpus can be utilized for **diachronic studies**; Yılandiloğlu (2025) demonstrated that poets adhered more accurately to the aruz meter over the centuries, reflected in rising conformity rates.
- The sample metadata allows you to focus on specific ranks (e.g., “Sultan”) or gender.
- Current work is focused on standardizing transliteration to the IJMES system and expanding the corpus further.
Impact and Downstream Tasks
This corpus was specifically curated and structured to support the development of Ottoman Turkish NLP resources. It has been used for:
* **Large Language Models:** The structured data was used to train models, including:
* Masked language model: [ota-roberta-base](https://huggingface.co/enesyila/ota-roberta-base)
* State-of-the-art Named Entity Recognition model for Ottoman Turkish: [ota-roberta-base-ner](https://huggingface.co/enesyila/ota-roberta-base-ner)
* A Universal Dependencies (UD) parser that tags with 91% accuracy and lemmatizes with 86% accuracy: [ota-ud-style](https://huggingface.co/enesyila/ota-ud-style)
* **Annotated Treebank:** The dataset serves as the basis for [UD_Ottoman_Turkish-DUDU](https://github.com/UniversalDependencies/UD_Ottoman_Turkish-DUDU), currently the largest Ottoman Turkish corpus in the Universal Dependencies.
Files
controversy_cache.json
Files
(93.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:946ec18d3daf0f66b961a58e859cc297
|
4.6 kB | Download |
|
md5:963858de64d9a9adb72a29b93f740ff4
|
16.0 kB | Download |
|
md5:3d71ef8a9451fe98a10789a5a8e0ac8d
|
10.9 kB | Download |
|
md5:204d883abd74592402e6a2da21a2e205
|
29.1 kB | Preview Download |
|
md5:5f80b6870a1169549592ef5c8564ee97
|
16.5 kB | Preview Download |
|
md5:c54b87f5c98c65d8dbcdefbccdc36c69
|
6.3 kB | Preview Download |
|
md5:3cda2937f8b53771ab64a150bf5a8c52
|
4.9 kB | Preview Download |
|
md5:b99eb3f4b9b13c3c9c9326b4a1a62685
|
5.4 kB | Preview Download |
Additional details
Dates
- Updated
-
2026-04-07
Software
- Development Status
- Active