Published April 27, 2021 | Version 1.0
Dataset Open

Datasets and code for "Mapping the plague through natural language processing"

  • 1. Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway and Centre for Mathematical Modelling of Infectious Diseases, London School of Hygiene & Tropical Medicine, London, UK
  • 2. Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway

Description

This project investigates the performance of various NLP libraries and geocoding services for the semi-automated generation of quantitative datasets from narrative texts. We provide the original files, several intermediate data products as well as the final plague datasets. Please note that some of the steps in this process were done manually, thus some of the scripts cannot be run completely. 

This work is based on two plague treatises:

- Sticker, G. 1908 Abhandlungen aus der Seuchengeschichte und Seuchenlehre. Band 1: Die Pest. Giessen, A. Töpelmann.

- Biraben, J.-N. 1975 Les hommes et la peste en France et dans les pays européens et méditerranéens. Paris, Mouton.

The final geocoded, plague datasets are:

- plague_sticker_v1.csv

- plague_biraben_v1.csv

A data dictionary is available as

plague_datadict.xlsx

Other files:

file name content
sticker_OCR_orig.txt Original OCR text
sticker_OCR.txt Original OCR text without parenthesis (author names)
sticker_textprep.rds Original OCR text with further preparations
sticker_goldstandard_annotated_1.tsv manual annotations file 1
sticker_goldstandard_annotated_2.tsv manual annotations file 2
sticker_goldstandard_annotated_consensus.tsv consenus annotation file
sticker_standard_toponyms.csv Gold standard for toponym recognition. Contains the tokenization, the start/end character respective to the OCR text (orig and without parenthesis) and whether a token is a location or other
sticker_comparison_NER.rds Comparison of NER performance
sticker_comparison_geocoding.rds Comparison of Geocoding performance

 

Notes

This work was supported by funding from the Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, and the Research Council of Norway (FRIMEDBIO project 288551).

Files

plague.zip

Files (24.4 MB)

Name Size Download all
md5:cee444d05ac60ccf95b35124d95aa616
24.4 MB Preview Download