Datasets and Models for Historical Newspaper Article Segmentation
- 1. EPFL
- 2. University of Zurich
- 3. Sofia
Description
This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link).
Please cite this paper if you are using the models/datasets or find it relevant to your research:
@article{barman_combining_2020,
title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}},
author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan},
journal= {Journal of Data Mining \& Digital Humanities},
volume= {HistoInformatics}
DOI = {10.5281/zenodo.4065271},
year = {2021},
url = {https://jdmdh.episciences.org/7097},
}
Please note that this record contains data under different licenses.
1. DATA
- Annotations (json files): JSON files contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in VIA format. The following licenses apply:
- luxwort.json: those annotations are under a CC0 1.0 license. Please refer to the right statement specified for each image in the file.
- GDL.json, IMP.json and JDG.json: those annotations are under a CC BY-SA 4.0 license.
- Image files: The archive images.zip contains the Swiss titles image files (GDL, IMP, JDG) used for the experiments described in the paper. Those images are under copyright (property of the journal Le Temps and of ArcInfo) and can be used for academic research or educational purposes only. Redistribution, publication or commercial use are not permitted. These terms of use are similar to the following right statement: http://rightsstatements.org/vocab/InC-EDU/1.0/
2. MODELS
Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release).
- JDG_flair-FT: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (
Serial
,Weather
,Death notice
andStocks
). - Luxwort_obituary_flair-bpemb: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the
Death notice
class. - Luxwort_obituary_flair-FT_indomain: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the
Death notice
class.
Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict
function: 1) embeddings_path
(the path to the embeddings list), 2) embeddings_map_path
(the path to the compressed embedding map), and 3) embeddings_dim
(the size of the embeddings).
Please refer to the paper for further information or contact us.
3. CODE:
https://github.com/dhlab-epfl/dhSegment-text
4. ACKNOWLEDGEMENTS
We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release.
This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719.
5. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Simon Clematide (UZH)
Files
GDL.json
Files
(6.0 GB)
Name | Size | Download all |
---|---|---|
md5:81bb16182ceddba0701d6abd7f469aac
|
349.8 kB | Preview Download |
md5:413f44ad1f3f3e44f6f6f94e6f5d1a10
|
5.5 GB | Preview Download |
md5:47edc76853bb2c28d181593038c3c536
|
454.1 kB | Preview Download |
md5:0c68f4ea6be768fa3af2786123531755
|
691.4 kB | Preview Download |
md5:3a2637d548027e5ea31f179eaada8d34
|
154.1 MB | Preview Download |
md5:552f7454df6ba86e51bacd97fa07ee74
|
15.8 MB | Preview Download |
md5:befd4ecb2ca38cdc5626b88d4662cf61
|
162.3 MB | Preview Download |
md5:213b69d3d5cd131657b3ba39eefbe2fa
|
162.7 MB | Preview Download |
md5:704ebed95c831ae28bff1f4f63d9d18d
|
1.3 kB | Preview Download |
md5:8c996f387b514278eb45900354d437ee
|
1.3 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Journal article: https://zenodo.org/record/4065271 (URL)
Funding
- Swiss National Science Foundation
- Media Monitoring of the Past CRSII5_173719