Datasets and code for "Mapping the plague through natural language processing"
Creators
- 1. Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway and Centre for Mathematical Modelling of Infectious Diseases, London School of Hygiene & Tropical Medicine, London, UK
- 2. Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
Description
This project investigates the performance of various NLP libraries and geocoding services for the semi-automated generation of quantitative datasets from narrative texts. We provide the original files, several intermediate data products as well as the final plague datasets. Please note that some of the steps in this process were done manually, thus some of the scripts cannot be run completely.
This work is based on two plague treatises:
- Sticker, G. 1908 Abhandlungen aus der Seuchengeschichte und Seuchenlehre. Band 1: Die Pest. Giessen, A. Töpelmann.
- Biraben, J.-N. 1975 Les hommes et la peste en France et dans les pays européens et méditerranéens. Paris, Mouton.
The final geocoded, plague datasets are:
- plague_sticker_v1.csv
- plague_biraben_v1.csv
A data dictionary is available as
plague_datadict.xlsx
Other files:
file name | content |
sticker_OCR_orig.txt | Original OCR text |
sticker_OCR.txt | Original OCR text without parenthesis (author names) |
sticker_textprep.rds | Original OCR text with further preparations |
sticker_goldstandard_annotated_1.tsv | manual annotations file 1 |
sticker_goldstandard_annotated_2.tsv | manual annotations file 2 |
sticker_goldstandard_annotated_consensus.tsv | consenus annotation file |
sticker_standard_toponyms.csv | Gold standard for toponym recognition. Contains the tokenization, the start/end character respective to the OCR text (orig and without parenthesis) and whether a token is a location or other |
sticker_comparison_NER.rds | Comparison of NER performance |
sticker_comparison_geocoding.rds | Comparison of Geocoding performance |
Notes
Files
plague.zip
Files
(24.4 MB)
Name | Size | Download all |
---|---|---|
md5:cee444d05ac60ccf95b35124d95aa616
|
24.4 MB | Preview Download |