Published January 20, 2025 | Version v3
Dataset Open

IGN Synthetic Train Data for ICDAR'25 MapText Competition

  • 1. ROR icon Université Gustave Eiffel
  • 2. ROR icon Institut national de l'information géographique et forestière
  • 3. EPITA

Description

Data set of 2Kx2K synthetic image tiles for the ICDAR'25 Competition on Historical Map Text Detection, Recognition, and Linking.

Annotations and images follow the format described at the competition website and can be evaluated using the official evaluation repository script.

This synthetic dataset is supplementary to the dataset of real tiles IGN Train and Validation Data for ICDAR'25 MapText Competition.

This synthetic training set mimics the style (background and fonts) of the original maps, and leverages the actual, modern land use database from the French government to generate realistic geometries and names from similar geographic areas (both in terms of vocabulary and urban density). This synthetic data is meant to be used as a supplementary training set, and is organized as such.

We also provide a sample for fast download and code testing, containing only the images and ground truth for the first 10 images of the dataset.

  Synthetic Train
Sample (in sample.zip)
Annotations ign25synth_train.json (same)
Images synthtrain.zip (same)
Files ign25synth/train/*.jpg (same)
Tiles 18,073 10
Map Sheets a dozen of different styles 1 style
Words 1,622,398 114
Label Groups 1,489,072 91
Illegible Words 33 0
Truncated Words 79,972 2
Valid Words 1,542,426 112

 All data used to generate this dataset is public domain.

Finally, a style_sample.zip file provides some examples for each rendering style. The images it contains are extracted from the main dataset and should not be added to it.

 

ℹ️ This version 3 features some improvements which impact all ZIP files:

  1. Some text region were rendered but not added to the ground truth — this is now fixed.
  2. Truncation detection was improved, but this should not change the actual content.
  3. Some images were generated with wrong shapes, leading to them being discarded in the final dataset. They are now exported correctly and included in the new version. As a result, the new dataset is larger.
  4. Finally, we added some sample images for each style in a new style_sample.zip file.

ℹ️ Version 2 added a fix in the  ign25synth_train.json file from which very small regions (<1 square pixel) were removed to mitigate evaluation issues. This results in a smaller number of total words and groups, but the number of valid words remains the same compared to version 1. The sample in sample.zip and the images in ign25synth_train.zip were not changed and are identical to version 1.

Files

ign25synth_train.json.zip

Files (3.4 GB)

Name Size Download all
md5:13aff0c1a1f804bca19fb29975333c57
141.7 MB Preview Download
md5:11ae81621f3941ba069452d4f1a089c8
3.2 GB Preview Download
md5:668da8c9f77ea6f6b5b0c3a5077aa02a
653.9 kB Preview Download
md5:dcaeeb3db02f70a3ec8d17932935c6f1
86.2 MB Preview Download

Additional details

Related works

Is described by
Publication: https://rrc.cvc.uab.es/?ch=32&com=tasks (URL)
Is supplement to
Dataset: 10.5281/zenodo.14392548 (DOI)
Is supplemented by
Software: https://github.com/icdar-maptext/evaluation (URL)