DigitConfuse-23k: A Synthetic Dataset of Digit Confusion Patterns
Description
📊 DigitConfuse-23k: A Synthetic Dataset of Digit Confusion Patterns
v DigitConfuse-23k is a synthetic dataset containing 23,000 images of digit pairs designed to capture visual anomalies and confusion cases commonly encountered in OCR, CAPTCHA recognition, optical illusions and human digit interpretation tasks.
v Each image contains two-digit numbers generated using the Humor-Sans font (font_size=32, cell_w=60, cell_h=40). For each confusion category, ~1000 images are included.
🔢 Categories of Digit Anomalies
🔸 Digit shape confusion (similar glyphs) → 11 ↔ 17, 21 ↔ 27, 71 ↔ 77
🔄 Mirror / rotation confusion → 69 ↔ 96, 68 ↔ 86, 89↔98, 26 ↔ 62
🎯 One-pixel stroke differences → 33 ↔ 38, 35 ↔ 36, 53 ↔ 58, 39↔89
🌀 Closed vs. open loop confusion → 38 ↔ 88, 98 ↔ 99, 18 ↔ 19, 56↔58, 28↔88
➿ Nearly identical when repeated → 88 ↔ 89, 11 ↔ 12, 55 ↔ 56
👀 Human OCR-like errors (CAPTCHA/OCR cases) → 47 ↔ 17, 57 ↔ 37, 12 ↔ 72, 14 ↔ 74
🎯 Applications
🧪 Benchmarking OCR systems
🛡 Studying digit recognition robustness
🔑 Training models for noisy / CAPTCHA-like digits
🚨 Anomaly detection in digit datasets
⚙️ Technical Details
📂 Total images: 23,000
📑 Categories: 23 confusion pairs
✍️ Font: Humor-Sans.ttf
🔠 Font size: 32
📏 Image cell size: 60 × 40 pixels, 2400x1000 image resolution
👉 This dataset provides a controlled testbed for studying digit misclassification under visually ambiguous conditions.
📦 How to Use
There are three files. One zip file contains all the images. The other two files contain meta information.
1.JSONL format (VQA-style for VLM testing)
Each entry includes:
🖼 image → file path to the digit image
❓ question → natural language query
✅ answer → ground truth numbers
2.CSV format (digit confusion localization)
The .csv file provides metadata about anomaly location:
🖼 image → file path
📌 location → anomaly position (row, col)
🚀 Suggested Use Cases
🤖 VLM evaluation → Test Qwen-VL, InternVL, LLaVA on fine-grained OCR tasks
📊 OCR benchmarking → Compare CNN-based OCR vs. multimodal LLMs
🔄 Data augmentation research → Train models to handle ambiguity
🕵️ Anomaly detection → Use confusion pairs as “hard negatives” for OCR
🧪 Real-World Testing with Ovis 2.5-9B (Latest Release)
We evaluated a subset of images using Ovis 2.5-9B (released Aug 2025).
🖼 Native-resolution ViT (NaViT) → preserves fine details for loop/ stroke differences
🔎 Reflective inference mode → improves reasoning under ambiguous digit confusions
🏆 Benchmark leader → achieves 78.3 avg. score on OpenCompass best among <40B param open-source models)
📌 Observation: Ovis 2.5-9B performed robustly across one-pixel stroke, mirror/rotation, and loop closure confusions, proving this dataset’s value for fine-grained OCR evaluation with VLMs.
Hence, any VLM can be tested for its fine grain digit classification amongst confusing digits.
Files
merged_puzzles.csv
Additional details
Dates
- Submitted
-
2025-08-22First Version (Two Digits)
References
- @article{lu2025ovis25technicalreport, title={Ovis2.5 Technical Report}, author={Shiyin Lu and Yang Li and Yu Xia and Yuwei Hu and Shanshan Zhao and Yanqing Ma and Zhichao Wei and Yinglun Li and Lunhao Duan and Jianshan Zhao and Yuxuan Han and Haijun Li and Wanying Chen and Junke Tang and Chengkun Hou and Zhixing Du and Tianli Zhou and Wenjie Zhang and Huping Ding and Jiahe Li and Wen Li and Gui Hu and Yiliang Gu and Siran Yang and Jiamang Wang and Hailong Sun and Yibo Wang and Hui Sun and Jinlong Huang and Yuping He and Shengze Shi and Weihong Zhang and Guodong Zheng and Junpeng Jiang and Sensen Gao and Yi-Feng Wu and Sijia Chen and Yuhui Chen and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang}, year={2025}, journal={arXiv:2508.11737} } @article{lu2024ovis, title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model}, author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye}, year={2024}, journal={arXiv:2405.20797} }