Published September 4, 2025 | Version v1
Dataset Open

CM1 Dataset

  • 1. ROR icon TU Dortmund University
  • 1. ROR icon TU Dortmund University
  • 2. ROR icon Osnabrück University

Description

This is the CM1-Dataset designed for the evaluation of information extraction from historical documents with Large Vision Language Models.

Paper: https://arxiv.org/abs/2505.04214

GitHub: https://github.com/fabiwo6/cm1

Abstract

The automatic extraction of key-value information from handwritten documents is a key challenge in document analysis. A reliable extraction is a prerequisite for the mass digitization efforts of many archives. Large Vision Language Models (LVLM) are a promising technology to tackle this problem especially in scenarios where little annotated training data is available. In this work, we present a novel dataset specifically designed to evaluate the few-shot capabilities of LVLMs. The CM1 documents are a historic collection of forms with handwritten entries created in Europe to administer the Care and Maintenance program after World War Two. The dataset establishes three benchmarks on extracting name and birthdate information and, furthermore, considers different training set sizes. We provide baseline results for two different LVLMs and compare performances to an established full-page extraction model. While the traditional full-page model achieves highly competitive performances, our experiments show that when only a few training samples are available the considered LVLMs benefit from their size and heavy pretraining and outperform the classical approach.

Annotations

  • cm1_cover_*.json:
"document_id": [{"Name": "last_name_person_1", 
                               "Vorname": "first_name_person_1",
                               "Geb-Dat": "birth_date_person_1"}, 
                              {"Name": "last_name_person_2", 
                               "Vorname": "first_name_person_2",
                               "Geb-Dat": "birth_date_person_2"}],
  • cm1_namedate_*.txt
cluster_id/document_id.jpg first_name last_name birth_date

 

Files

cm1_cover.zip

Files (44.1 GB)

Name Size Download all
md5:b367a8f7f3cebae0bd63443f1ad63ace
23.4 GB Preview Download
md5:d8914a01cf127cb36e7de9fa0d0fcd78
1.1 MB Preview Download
md5:7a2dc1f58708aacde5a43332e0eba3d7
13.8 MB Preview Download
md5:84971b8e37f0e159a141888404b30f2a
578.7 kB Preview Download
md5:ac6a55f3292eb34e25d908391518fe39
20.7 GB Preview Download
md5:f85cf5a413bb4a22d518aeb258e62c8c
423.0 kB Preview Download
md5:cd87079242169aa3e050db42bfdf015c
3.8 MB Preview Download
md5:1b3e558f8b8cca8635469de37d5d1170
211.4 kB Preview Download

Additional details

Related works

Is described by
Publication: arXiv:2505.04214 (arXiv)

Software