Published January 30, 2021 | Version 0.1
Dataset Open

Datasets and Models for Historical Newspaper Article Segmentation

  • 1. EPFL
  • 2. University of Zurich
  • 3. Sofia

Description

This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link).

Please cite this paper if you are using the models/datasets or find it relevant to your research:

@article{barman_combining_2020,
    title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}},
    author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan},
    journal= {Journal of Data Mining \& Digital Humanities},
    volume= {HistoInformatics}
    DOI = {10.5281/zenodo.4065271},
    year = {2021},
    url = {https://jdmdh.episciences.org/7097},
}


Please note that this record contains data under different licenses.

1. DATA

  • Annotations (json files): JSON files contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in VIA format. The following licenses apply:
    •  luxwort.json: those annotations are under a CC0 1.0 license. Please refer to the right statement specified for each image in the file.
    • GDL.json, IMP.json and JDG.json: those annotations are under a CC BY-SA 4.0 license.

 

  • Image files: The archive images.zip contains the Swiss titles image files (GDL, IMP, JDG) used for the experiments described in the paper. Those images are under copyright (property of the journal Le Temps and of ArcInfo) and can be used for academic research or educational purposes only. Redistribution, publication or commercial use are not permitted. These terms of use are similar to the following right statement: http://rightsstatements.org/vocab/InC-EDU/1.0/

 

2. MODELS

Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release).

  • JDG_flair-FT: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (Serial, Weather, Death notice and Stocks).
  • Luxwort_obituary_flair-bpemb: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the Death notice class.
  • Luxwort_obituary_flair-FT_indomain: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the Death notice class.

Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict function: 1) embeddings_path (the path to the embeddings list), 2) embeddings_map_path(the path to the compressed embedding map), and 3) embeddings_dim (the size of the embeddings).

Please refer to the paper for further information or contact us.

 

3. CODE: 

https://github.com/dhlab-epfl/dhSegment-text


4. ACKNOWLEDGEMENTS
We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release.
This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719.

5. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Simon Clematide (UZH)

Files

GDL.json

Files (6.0 GB)

Name Size Download all
md5:81bb16182ceddba0701d6abd7f469aac
349.8 kB Preview Download
md5:413f44ad1f3f3e44f6f6f94e6f5d1a10
5.5 GB Preview Download
md5:47edc76853bb2c28d181593038c3c536
454.1 kB Preview Download
md5:0c68f4ea6be768fa3af2786123531755
691.4 kB Preview Download
md5:3a2637d548027e5ea31f179eaada8d34
154.1 MB Preview Download
md5:552f7454df6ba86e51bacd97fa07ee74
15.8 MB Preview Download
md5:befd4ecb2ca38cdc5626b88d4662cf61
162.3 MB Preview Download
md5:213b69d3d5cd131657b3ba39eefbe2fa
162.7 MB Preview Download
md5:704ebed95c831ae28bff1f4f63d9d18d
1.3 kB Preview Download
md5:8c996f387b514278eb45900354d437ee
1.3 kB Preview Download

Additional details

Related works

Is supplement to
Journal article: https://zenodo.org/record/4065271 (URL)

Funding

Media Monitoring of the Past CRSII5_173719
Swiss National Science Foundation