Paris and Jerusalem Maps Text Dataset
Creators
Description
This dataset contains 82 annotated map samples from diverse historical city maps of Jerusalem and Paris, suitable for map text detection, recognition, and sequencing.
Organization of the data
The data in maptext_format.json
is organized in the same way as in the General Data from the David Rumsey Collection from ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking [1].
The data is structured by image
, and list of sequences (groups
). The boolean attributes illegible
and truncated
are used to provide additional insight on the data quality.
Our interpretation of the truncated
and illegible
tags is the following:
truncated
refers to the case where part of a word is located outside the image crop, and is thus missing. In that case, the transcription stops at the image border, focusing only on the visible part of the wordillegible
is a subjective indication of (un)certainty in the transcription provided. Whenever possible, a best guess transcription is provided. Otherwise, the illegible letters are filled with blank spaces
The text
corresponds to the diplomatic transcription, i.e. as it appears on the document. Text are transcribes with all latin characters, with cases, diacritics (e.g. ö, ḡ) and diagraphs (e.g. Œ).
Each word polygon consists of an even number of vertices arranged in clockwise order starting from the initial point to the top left. The first n/2 vertices represent the upper boundary line following the reading direction, while the second half represents the lower boundary line in the reverse direction. Here is an illustration:
Example structure
[
{
"image": "map_image_1.jpg",
# Here groups are what we call sequences.
"groups": [
{
"vertices": [[x1, y1], [x2, y2], ...],
"text": "Champs",
"illegible": "false",
"truncated": "false"
},
{
"vertices": [[x1, y1], [x2, y2], ...],
"text": "Elysées",
"illegible": "false",
"truncated": "false"
}
]
}
]
The file pandas_format.pkl
contains the same data. It is only provided for convenience.
Image documents
The maps of Paris were taken from the Historical City Maps Semantic Segmentation Dataset [2]. The original documents were digitized by the Bibliothèque nationale de France (BnF), and the Bibliothèque Historique de la Ville de Paris (BHVP).
The maps of Jerusalem were curated from the collections of the National Library of Israel (NLI), and Wikimedia Commons.
Descriptive statistics
Number of words: 7528
Number of single-word sequences: 1757
Number of multi-word sequences: 1969
Statistics of multi-word sequences length:
mean: 2.93 words
std: 1.25 words
min: 2.00 words
med: 3.00 words
max: 15.00 words
Future updates
The transcribed text
, corresponds to the diplomatic transcription, suitable for text recognition tasks. In future updates, we hope to complement it with an additional normalization
attribute, which could extend abbreviations (e.g. "bvd." => "boulevard") and normalize transcriptions (e.g. "QVARTER" => "QUARTER").
Use and Citation
For any mention of this dataset, please cite :
@misc{paris_jerusalem_dataset_2025,
author = {Dai, Tianhao and Johnson, Kaede and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
title = {{Paris and Jerusalem City Maps Text Dataset}},
year = {2025},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.14982662}}@article{recognizing_sequencing_2025,
author = {Zou, Mengjie and Dai, Tianhao and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
title = {{Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer}},
year = {2025}}
Corresponding author
Rémi PETITPIERRE - remi.petitpierre@epfl.ch - ORCID - Github - Scholar - ResearchGate
Work ethics
The data were annotated by two master's students from EPFL, Switzerland. The students were paid for their work using public funding, and were offered the possibility to be associated with the publication of the data.
License
This project is licensed under the CC BY 4.0 License.
Liability
We do not assume any liability for the use of this dataset.
References
- Li Z., Lin Y., Chiang Y.-Y., Weinman J., Tual S., Chazalon J., Perret J., Duménieu B., Abadie N. (2024). ICDAR 2024 competition on historical map text detection, recognition, and linking. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part VI. Springer-Verlag, Berlin, Heidelberg, 363–380. https://doi.org/10.1007/978-3-031-70552-6_22
- Petitpierre, R. (2021). Historical City Maps Semantic Segmentation Dataset. V1.0. https://doi.org/10.5281/zenodo.5513639
Files
example_annotation.png
Files
(104.8 MB)
Name | Size | Download all |
---|---|---|
md5:3667b520f9cdaa5b40ee67d6b55a939e
|
2.5 MB | Preview Download |
md5:60e42d3790d0d0054c9322ce98292ed9
|
96.9 MB | Preview Download |
md5:85c2133cc92d953456a6b365eead580a
|
2.7 MB | Preview Download |
md5:bd98e7365670f2021ff428d2a792fadb
|
2.7 MB | Download |
md5:38390e466a8d2e028765528d42108a25
|
4.0 kB | Preview Download |
Additional details
References
- Li Z., Lin Y., Chiang Y.-Y., Weinman J., Tual S., Chazalon J., Perret J., Duménieu B., Abadie N. (2024). ICDAR 2024 competition on historical map text detection, recognition, and linking. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part VI. Springer-Verlag, Berlin, Heidelberg, 363–380. https://doi.org/10.1007/978-3-031-70552-6_22
- Petitpierre, R. (2021). Historical City Maps Semantic Segmentation Dataset. V1.0. https://doi.org/10.5281/zenodo.5513639