Published February 24, 2026 | Version v1
Dataset Open

OpenHand-Synth: A Large-Scale Synthetic Handwriting Dataset for Multimodal Language Models

Authors/Creators

Description

OpenHand-Synth is a large-scale synthetic handwriting dataset containing 68,077 high-resolution images of handwritten text across 15 languages (English, Dutch, French, German, Spanish, French, Italian, Portuguese, Danish, Swedish, Norwegian Bokmål, Romanian, Indonesian, Malay, and Tagalog). The dataset features 220 distinct writer styles, variable ink colors (black, blue, green, red), and realistic noise augmentations simulating real-world document conditions.

Images are generated using a neural handwriting synthesis model and paired with rich metadata including ground truth text, writer style ID, neatness score, ink color, OCR validation results, Character Error Rate (CER), and Jaro-Winkler similarity. All images were quality-validated using the Qwen2.5-VL-72B-Instruct vision-language model. The final dataset achieves a mean CER of ~0.03, with approximately 54% of samples reaching perfect recognition (CER = 0.0). Text content is drawn from three categories: factual prose (Simple English Wikipedia), conversational sentences (Tatoeba), and structured data (locale-aware dates, names, and numbers via Faker).

The dataset is split 80/10/10 (train/validation/test) with stratification by writer, source, and language to ensure balanced representation across all splits. It supports multiple downstream tasks including optical character recognition (OCR), visual question answering (VQA), writer identification, style transfer, and multimodal document understanding.

The dataset is released under CC BY 4.0.

For easy access, find the dataset on Huggingface: https://huggingface.co/datasets/to-be/OpenHand-Synth

Files

beerten2025_openhand-synth.pdf

Files (8.6 GB)

Name Size Download all
md5:8ebe3e737100c6add8ecaad980a4720c
3.1 MB Preview Download
md5:121480ffda7e4282f51e02601cf6c99f
8.5 GB Preview Download