Lessons from Crises and Disasters: An Experimental Study of Data Collection Methods

Błoch, Agata; Fusco, Idamaria; Avallone, Paola; Cecchini, Isabella; Horeczy, Anna; Hussain, Saddam; Macrì, Geltrude; Rabà, Michele; Rzońca, Apolinary; Salvemini, Raffaella

doi:10.5281/zenodo.20529206

Published June 3, 2026 | Version v1

Dataset Open

Lessons from Crises and Disasters: An Experimental Study of Data Collection Methods

The LSFD (Lessons from Crises and Disasters) dataset was developed as part of an experimental study on data collection methods conducted within the research project "Lessons from Crises and Disasters: The Role of Information Flows between Individuals, Communities, and Institutions. Comparative Case Studies between Italy and Poland (16th–19th Century)." The project is carried out under the Joint Bilateral Agreement between the National Research Council of Italy (CNR) and the Polish Academy of Sciences (PAN).

The purpose of the experimental study was to evaluate and refine methods for collecting, coding, and analyzing heterogeneous historical data on crises and disasters. The resulting dataset compiles information from selected case studies.

The attached materials include 1) the LFCD Dataset, which contains the records collected and structured during the experimental phase, and 2) methodological report describing the research design, data collection procedures, source selection criteria, data structure, quality assurance processes, and limitations of the dataset.

Description: The LSFD (Lessons from Crises and Disasters) dataset integrates heterogeneous historical sources – manuscript, printed, and structured – within a single analytical infrastructure. The project unites archives from Palermo, Venice, Naples, Lisbon, Kraków, Lublin, and Portugal, treating them as a unified corpus for comparative study of epidemics and administration in the early modern period.

The objective is to combine diverse archival sources – handwritten images, printed or typed PDFs, and individual spreadsheets – into one navigable corpus and analysis. The material is accessible to a global readership by pairing the original text with an English translation and adding per-entry topic tags and higher-level groupings, enabling researchers to search, navigate, and catalog consistently across collections.

Methodology: Handwritten images are transcribed with a vision LLM (ChatGPT), then processed by a structuring LLM to extract fields (Date, Location when present, Entry Text), preserving on-page numbering as separate entries. Printed or typed PDFs are processed with pdfplumber and light NLP (line joining, de-hyphenation, header/footer removal) and split into numbered entries. Spreadsheets from collaborators are mapped to a canonical schema through header normalisation and date parsing, while retaining all original columns for provenance. Across all streams, original date strings and orthography are preserved, source text is paired with an English gloss, overlaps are de-duplicated, and concise topic labels are assigned using Google Gemini 2.5 Pro (Reasoning); topics are then aggregated into parent–child families.

Files

LFCD dataset - report.pdf

Files (5.3 MB)

Name	Size	Download all
LFCD dataset - report.pdf md5:0ffb405df4c77f7810f28ad2c158d447	2.0 MB	Preview Download
LFCD dataset.csv md5:2f8cdf9d88becf973190d4c7dca1f2bf	2.5 MB	Preview Download
LFCD dataset.xlsx md5:47d53ba10d5a9517f0d1eeb44c250229	893.3 kB	Download

Additional details

Created: 2026-06-03

LFCD dataset

	All versions	This version
Views	35	35
Downloads	30	30
Data volume	76.2 MB	76.2 MB

Lessons from Crises and Disasters: An Experimental Study of Data Collection Methods

Authors/Creators

Description

Files

LFCD dataset - report.pdf

Files (5.3 MB)

Additional details

Dates