Published March 6, 2025 | Version 1.0.0
Dataset Open

Paris and Jerusalem Maps Text Dataset

Description

This dataset contains 82 annotated map samples from diverse historical city maps of Jerusalem and Paris, suitable for map text detection, recognition, and sequencing.

Organization of the data

The data in maptext_format.json is organized in the same way as in the General Data from the David Rumsey Collection from ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking [1].

The data is structured by image, and list of sequences (groups). The boolean attributes illegible and truncated are used to provide additional insight on the data quality.

Our interpretation of the truncated and illegible tags is the following:

  • truncated refers to the case where part of a word is located outside the image crop, and is thus missing. In that case, the transcription stops at the image border, focusing only on the visible part of the word
  • illegible is a subjective indication of (un)certainty in the transcription provided. Whenever possible, a best guess transcription is provided. Otherwise, the illegible letters are filled with blank spaces

The text corresponds to the diplomatic transcription, i.e. as it appears on the document. Text are transcribes with all latin characters, with cases, diacritics (e.g. ö, ḡ) and diagraphs (e.g. Œ).

Each word polygon consists of an even number of vertices arranged in clockwise order starting from the initial point to the top left. The first n/2 vertices represent the upper boundary line following the reading direction, while the second half represents the lower boundary line in the reverse direction. Here is an illustration:

Example structure

[                                                 
  {                                               
    "image": "map_image_1.jpg",                   
                                                  
    # Here groups are what we call sequences.     
    "groups": [                                   
      {                                           
        "vertices": [[x1, y1], [x2, y2], ...],    
        "text": "Champs",                         
        "illegible": "false",                     
        "truncated": "false"                      
      },                                          
      {                                           
        "vertices": [[x1, y1], [x2, y2], ...],    
        "text": "Elysées",                        
        "illegible": "false",                     
        "truncated": "false"                      
      }                                           
    ]                                             
  }                                               
]                                                 

The file pandas_format.pkl contains the same data. It is only provided for convenience.

Image documents

The maps of Paris were taken from the Historical City Maps Semantic Segmentation Dataset [2]. The original documents were digitized by the Bibliothèque nationale de France (BnF), and the Bibliothèque Historique de la Ville de Paris (BHVP). 

The maps of Jerusalem were curated from the collections of the National Library of Israel (NLI), and Wikimedia Commons.

Descriptive statistics

Number of words: 7528
Number of single-word sequences: 1757
Number of multi-word sequences: 1969
Statistics of multi-word sequences length:
    mean: 2.93 words
    std: 1.25 words
    min: 2.00 words
    med: 3.00 words
    max: 15.00 words

Future updates

The transcribed text, corresponds to the diplomatic transcription, suitable for text recognition tasks. In future updates, we hope to complement it with an additional normalization attribute, which could extend abbreviations (e.g. "bvd." => "boulevard") and normalize transcriptions (e.g. "QVARTER" => "QUARTER").

Use and Citation

For any mention of this dataset, please cite :

@misc{paris_jerusalem_dataset_2025,
  author = {Dai, Tianhao and Johnson, Kaede and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
  title = {{Paris and Jerusalem City Maps Text Dataset}},
  year = {2025},
  publisher = {Zenodo},
  url = {https://doi.org/10.5281/zenodo.14982662}}


@article{recognizing_sequencing_2025,
  author = {Zou, Mengjie and Dai, Tianhao and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
  title = {{Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer}},
  year = {2025}}

Corresponding author

Rémi PETITPIERRE - remi.petitpierre@epfl.ch - ORCID - GithubScholar - ResearchGate 

Work ethics

The data were annotated by two master's students from EPFL, Switzerland. The students were paid for their work using public funding, and were offered the possibility to be associated with the publication of the data.

License

This project is licensed under the CC BY 4.0 License. 

Liability

We do not assume any liability for the use of this dataset.

References

  1. Li Z., Lin Y., Chiang Y.-Y., Weinman J., Tual S., Chazalon J., Perret J., Duménieu B., Abadie N. (2024). ICDAR 2024 competition on historical map text detection, recognition, and linking. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part VI. Springer-Verlag, Berlin, Heidelberg, 363–380. https://doi.org/10.1007/978-3-031-70552-6_22
  2. Petitpierre, R. (2021). Historical City Maps Semantic Segmentation Dataset. V1.0. https://doi.org/10.5281/zenodo.5513639

Files

example_annotation.png

Files (104.8 MB)

Name Size Download all
md5:3667b520f9cdaa5b40ee67d6b55a939e
2.5 MB Preview Download
md5:60e42d3790d0d0054c9322ce98292ed9
96.9 MB Preview Download
md5:85c2133cc92d953456a6b365eead580a
2.7 MB Preview Download
md5:bd98e7365670f2021ff428d2a792fadb
2.7 MB Download
md5:38390e466a8d2e028765528d42108a25
4.0 kB Preview Download

Additional details

References

  • Li Z., Lin Y., Chiang Y.-Y., Weinman J., Tual S., Chazalon J., Perret J., Duménieu B., Abadie N. (2024). ICDAR 2024 competition on historical map text detection, recognition, and linking. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part VI. Springer-Verlag, Berlin, Heidelberg, 363–380. https://doi.org/10.1007/978-3-031-70552-6_22
  • Petitpierre, R. (2021). Historical City Maps Semantic Segmentation Dataset. V1.0. https://doi.org/10.5281/zenodo.5513639