Latin-transliterated Ottoman Turkish Corpus (LATOC)

Yılandiloğlu, Enes

doi:10.5281/zenodo.19445113

Published April 7, 2026 | Version v5

Dataset Open

Latin-transliterated Ottoman Turkish Corpus (LATOC)

Yılandiloğlu, Enes (Other)¹

1. University of Helsinki

LATOC Corpus

LATOC (Latin-transliterated Ottoman Turkish Corpus) includes 143 Ottoman Turkish books, 13,252,350 words, written between the 15th and 20th centuries. The books were transliterated by domain experts and publicly shared on the Internet. The books in the corpus were automatically structured via a rule‑based approach and manually checked.

Due to the copyright restrictions, this repository does not have any raw text; however, it guides you to download the files, convert them into structured XML files, and process them on your computer to get LATOC. The pipeline provided here can be extended to new sources by simply acquiring the PDF documents and, if necessary, updating the `CERTAIN_RULES` variable in 2_preprocessing.py and the files `controversy_cache.json` and `exclude_pages.txt`.

Corpus Overview

The corpus has more than 13 million words from 143 works written between the 15th and 20th centuries.

While the pipeline standardizes these texts into the IJMES format, some inconsistencies, such as normalization of spelling, may persist. These are primarily inherited from the transliteration provided by the domain experts. This work does not apply any normalization to the spelling.

Arabic/Persian characters in the documents are filtered out, and only the Latin-transliterated text is preserved.

Each document is split into pages. Each page is divided into three segments: paragraph, comprising the main text, title, and footnote.

The final XML files also provide the coordinates of the regions on the page, like "bbox="277.0,711.1,651.2,765.0".

Data Files

Work‑level metadata (`LATOC_metadata_sample.csv`)

This is a sample of the metadata with further information. It provides metadata for 36 Dîvân works.

Note that this metadata is from the previous version of LATOC, which is the reason why it has only 36 works.

In the future, this scheme will be expanded to all works in LATOC.

Each Dîvân work is accompanied by:

- `file_name`

- `work_name` (title of the Dîvân)

- `pen_name` (mahlas)

- `real_name`

- `viaf`

- `century`

- `gender`

- `rank`

- e.g., “Sultan,” “Judiciary & Religious Office,” “High Bureaucracy/Military,”

“Scholars & Sufi Orders,” “Civil Bureaucracy,” “Lay/Non‑official”

Overall data statistics (`data_statistics.csv`)

This file includes basic statistics such as the word count per document. It also provides links to some of the documents you can download for the work. Note that some works miss the URL links to download documents here. The document names in the column 'file' should be enough for users to find the document.

Book‑level data

Since the data was under copyright, this repository does not have it directly. However, you can download the data from _Yazma Eserler via either here or this webpage and then run the Python scripts as explained in this document to have the processed data on your device.

Supplementary material (`controversy_cache.json` and `exclude_pages.txt`)

`controversy_cache.json` includes the conversion rules for the problematic characters, which might be converted into more than one character in the IJMES chart. Since each document behaves differently, it provides the conversion rule based on the document. You can add a new rule here for your new documents.

`exclude_pages.txt` has the page boundaries that should be deleted to remove the editorial preface, table of contents, and references, etc. You can enlarge this file if you add a new source to the data.

Processing the data with Python scripts

After downloading the files and storing them in a single folder, you should run the Python scripts 1_pdf_extractor.py, 2_preprocessing.py, and 3_xml_cleaner.py in turn. 2_preprocessing.py requires the supplementary file, `controversy_cache.json`. For 3_xml_cleaner, you need the supplementary document exclude_pages.txt. These files are prepared for 144 works presented in this dataset by the author.

Usage Notes

- The corpus can be utilized for **diachronic studies**; Yılandiloğlu (2025) demonstrated that poets adhered more accurately to the aruz meter over the centuries, reflected in rising conformity rates.

- The sample metadata allows you to focus on specific ranks (e.g., “Sultan”) or gender.

- Current work is focused on standardizing transliteration to the IJMES system and expanding the corpus further.

Impact and Downstream Tasks

This corpus was specifically curated and structured to support the development of Ottoman Turkish NLP resources. It has been used for:

* **Large Language Models:** The structured data was used to train models, including:

* Masked language model: [ota-roberta-base](https://huggingface.co/enesyila/ota-roberta-base)

* State-of-the-art Named Entity Recognition model for Ottoman Turkish: [ota-roberta-base-ner](https://huggingface.co/enesyila/ota-roberta-base-ner)

* A Universal Dependencies (UD) parser that tags with 91% accuracy and lemmatizes with 86% accuracy: [ota-ud-style](https://huggingface.co/enesyila/ota-ud-style)

* **Annotated Treebank:** The dataset serves as the basis for [UD_Ottoman_Turkish-DUDU](https://github.com/UniversalDependencies/UD_Ottoman_Turkish-DUDU), currently the largest Ottoman Turkish corpus in the Universal Dependencies.

Files

controversy_cache.json

Files (93.7 kB)

Name	Size	Download all
1_pdf_extractor.py md5:946ec18d3daf0f66b961a58e859cc297	4.6 kB	Download
2_preprocessing.py md5:963858de64d9a9adb72a29b93f740ff4	16.0 kB	Download
3_xml_cleaner.py md5:3d71ef8a9451fe98a10789a5a8e0ac8d	10.9 kB	Download
controversy_cache.json md5:204d883abd74592402e6a2da21a2e205	29.1 kB	Preview Download
data_statistics.csv md5:5f80b6870a1169549592ef5c8564ee97	16.5 kB	Preview Download
exclude_pages.txt md5:c54b87f5c98c65d8dbcdefbccdc36c69	6.3 kB	Preview Download
LATOC_metadata_sample.csv md5:3cda2937f8b53771ab64a150bf5a8c52	4.9 kB	Preview Download
README.md md5:b99eb3f4b9b13c3c9c9326b4a1a62685	5.4 kB	Preview Download

Additional details

Updated: 2026-04-07

Development Status: Active

	All versions	This version
Views	430	77
Downloads	182	148
Data volume	178.9 MB	1.6 MB

LATOC Corpus

Corpus Overview

Data Files

Work‑level metadata (`LATOC_metadata_sample.csv`)

Overall data statistics (`data_statistics.csv`)

Book‑level data

Supplementary material (`controversy_cache.json` and `exclude_pages.txt`)

Processing the data with Python scripts

Usage Notes

Impact and Downstream Tasks

controversy_cache.json

Files (93.7 kB)

Dates

Software

Latin-transliterated Ottoman Turkish Corpus (LATOC)

Authors/Creators

Description

LATOC Corpus

Corpus Overview

Data Files

Work‑level metadata (`LATOC_metadata_sample.csv`)

Overall data statistics (`data_statistics.csv`)

Book‑level data

Supplementary material (`controversy_cache.json` and `exclude_pages.txt`)

Processing the data with Python scripts

Usage Notes

Impact and Downstream Tasks

Files

controversy_cache.json

Files (93.7 kB)

Additional details

Dates

Software