OpenHand-Synth: A Large-Scale Synthetic Handwriting Dataset for Multimodal Language Models

Beerten, Toon

doi:10.5281/zenodo.18759951

Published February 24, 2026 | Version v1

Dataset Open

OpenHand-Synth: A Large-Scale Synthetic Handwriting Dataset for Multimodal Language Models

Beerten, Toon (Researcher)

OpenHand-Synth is a large-scale synthetic handwriting dataset containing 68,077 high-resolution images of handwritten text across 15 languages (English, Dutch, French, German, Spanish, French, Italian, Portuguese, Danish, Swedish, Norwegian Bokmål, Romanian, Indonesian, Malay, and Tagalog). The dataset features 220 distinct writer styles, variable ink colors (black, blue, green, red), and realistic noise augmentations simulating real-world document conditions.

Images are generated using a neural handwriting synthesis model and paired with rich metadata including ground truth text, writer style ID, neatness score, ink color, OCR validation results, Character Error Rate (CER), and Jaro-Winkler similarity. All images were quality-validated using the Qwen2.5-VL-72B-Instruct vision-language model. The final dataset achieves a mean CER of ~0.03, with approximately 54% of samples reaching perfect recognition (CER = 0.0). Text content is drawn from three categories: factual prose (Simple English Wikipedia), conversational sentences (Tatoeba), and structured data (locale-aware dates, names, and numbers via Faker).

The dataset is split 80/10/10 (train/validation/test) with stratification by writer, source, and language to ensure balanced representation across all splits. It supports multiple downstream tasks including optical character recognition (OCR), visual question answering (VQA), writer identification, style transfer, and multimodal document understanding.

The dataset is released under CC BY 4.0.

For easy access, find the dataset on Huggingface: https://huggingface.co/datasets/to-be/OpenHand-Synth

Files

beerten2025_openhand-synth.pdf

Files (8.6 GB)

Name	Size
beerten2025_openhand-synth.pdf md5:8ebe3e737100c6add8ecaad980a4720c	3.1 MB	Preview Download
hf-dataset.zip md5:121480ffda7e4282f51e02601cf6c99f	8.5 GB	Preview Download

	All versions	This version
Views	172	172
Downloads	187	187
Data volume	453.7 GB	453.7 GB

OpenHand-Synth: A Large-Scale Synthetic Handwriting Dataset for Multimodal Language Models

Authors/Creators

Description

Files

beerten2025_openhand-synth.pdf

Files (8.6 GB)