IGN Synthetic Train Data for ICDAR'25 MapText Competition

Tual, Solenn; Abadie, Nathalie; Duménieu, Bertrand; Chazalon, Joseph; Perret, Julien

doi:10.5281/zenodo.14704475

Published January 20, 2025 | Version v3

Dataset Open

IGN Synthetic Train Data for ICDAR'25 MapText Competition

1. Université Gustave Eiffel
2. Institut national de l'information géographique et forestière
3. EPITA

Data set of 2Kx2K synthetic image tiles for the ICDAR'25 Competition on Historical Map Text Detection, Recognition, and Linking.

Annotations and images follow the format described at the competition website and can be evaluated using the official evaluation repository script.

This synthetic dataset is supplementary to the dataset of real tiles IGN Train and Validation Data for ICDAR'25 MapText Competition.

This synthetic training set mimics the style (background and fonts) of the original maps, and leverages the actual, modern land use database from the French government to generate realistic geometries and names from similar geographic areas (both in terms of vocabulary and urban density). This synthetic data is meant to be used as a supplementary training set, and is organized as such.

We also provide a sample for fast download and code testing, containing only the images and ground truth for the first 10 images of the dataset.

	Synthetic Train	Sample (in `sample.zip`)
Annotations	`ign25synth_train.json`	(same)
Images	`synthtrain.zip`	(same)
Files	`ign25synth/train/*.jpg`	(same)
Tiles	18,073	10
Map Sheets	a dozen of different styles	1 style
Words	1,622,398	114
Label Groups	1,489,072	91
Illegible Words	33	0
Truncated Words	79,972	2
Valid Words	1,542,426	112

All data used to generate this dataset is public domain.

Finally, a style_sample.zip file provides some examples for each rendering style. The images it contains are extracted from the main dataset and should not be added to it.

ℹ️ This version 3 features some improvements which impact all ZIP files:

Some text region were rendered but not added to the ground truth — this is now fixed.
Truncation detection was improved, but this should not change the actual content.
Some images were generated with wrong shapes, leading to them being discarded in the final dataset. They are now exported correctly and included in the new version. As a result, the new dataset is larger.
Finally, we added some sample images for each style in a new style_sample.zip file.

ℹ️ Version 2 added a fix in the ign25synth_train.json file from which very small regions (<1 square pixel) were removed to mitigate evaluation issues. This results in a smaller number of total words and groups, but the number of valid words remains the same compared to version 1. The sample in sample.zip and the images in ign25synth_train.zip were not changed and are identical to version 1.

Files

ign25synth_train.json.zip

Files (3.4 GB)

Name	Size	Download all
ign25synth_train.json.zip md5:13aff0c1a1f804bca19fb29975333c57	141.7 MB	Preview Download
ign25synth_train.zip md5:11ae81621f3941ba069452d4f1a089c8	3.2 GB	Preview Download
sample.zip md5:668da8c9f77ea6f6b5b0c3a5077aa02a	653.9 kB	Preview Download
style_sample.zip md5:dcaeeb3db02f70a3ec8d17932935c6f1	86.2 MB	Preview Download

Additional details

Is described by: Publication: https://rrc.cvc.uab.es/?ch=32&com=tasks (URL)
Is supplement to: Dataset: 10.5281/zenodo.14392548 (DOI)
Is supplemented by: Software: https://github.com/icdar-maptext/evaluation (URL)

	All versions	This version
Views	408	192
Downloads	638	356
Data volume	720.1 GB	370.3 GB

IGN Synthetic Train Data for ICDAR'25 MapText Competition

Authors/Creators

Description

Files

ign25synth_train.json.zip

Files (3.4 GB)

Additional details

Related works