Synthetic dataset for multi-script text line recognition

NAJEM-MEYER, SVEN

doi:10.5281/zenodo.14840349

Published February 9, 2025 | Version v1

Dataset Open

Synthetic dataset for multi-script text line recognition

NAJEM-MEYER, SVEN

Contributors

Project leader:

Romanello, Matteo

Supervisor:

Kaplan, Frederic¹

1. Ecole Polytechnique Federale de Lausanne Lemaitre Lab

Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.

Files

Files (1.3 GB)

Name	Size	Download all
OCR_artificial_data_sample.tar.gz md5:b1c322ce5a286b3d14740d9afbe9f3ab	1.3 GB	Download

Additional details

Swiss National Science Foundation
How does a classical hero die in the digital age? Using Sophocles’ Ajax to create a commentary on commentaries 186033

Repository URL: https://github.com/AjaxMultiCommentary/ajmc-pipeline
Programming language: Python
Development Status: Wip

	All versions	This version
Views	41	41
Downloads	28	28
Data volume	37.6 GB	37.6 GB

Synthetic dataset for multi-script text line recognition

Contributors

Project leader:

Supervisor:

Files

Files (1.3 GB)

Additional details

Funding

Software

Synthetic dataset for multi-script text line recognition

Creators

Contributors

Project leader:

Supervisor:

Description

Files

Files (1.3 GB)

Additional details

Funding

Software