Published June 3, 2026 | Version v1
Dataset Open

Lessons from Crises and Disasters: An Experimental Study of Data Collection Methods

Description

The LSFD (Lessons from Crises and Disasters) dataset was developed as part of an experimental study on data collection methods conducted within the research project "Lessons from Crises and Disasters: The Role of Information Flows between Individuals, Communities, and Institutions. Comparative Case Studies between Italy and Poland (16th–19th Century)." The project is carried out under the Joint Bilateral Agreement between the National Research Council of Italy (CNR) and the Polish Academy of Sciences (PAN).

The purpose of the experimental study was to evaluate and refine methods for collecting, coding, and analyzing heterogeneous historical data on crises and disasters. The resulting dataset compiles information from selected case studies.

The attached materials include 1) the LFCD Dataset, which contains the records collected and structured during the experimental phase, and 2) methodological report describing the research design, data collection procedures, source selection criteria, data structure, quality assurance processes, and limitations of the dataset.

Description: The LSFD (Lessons from Crises and Disasters) dataset integrates heterogeneous historical sources – manuscript, printed, and structured – within a single analytical infrastructure. The project unites archives from Palermo, Venice, Naples, Lisbon, Kraków, Lublin, and Portugal, treating them as a unified corpus for comparative study of epidemics and administration in the early modern period.

The objective is to combine diverse archival sources – handwritten images, printed or typed PDFs, and individual spreadsheets – into one navigable corpus and analysis. The material is accessible to a global readership by pairing the original text with an English translation and adding per-entry topic tags and higher-level groupings, enabling researchers to search, navigate, and catalog consistently across collections.

Methodology: Handwritten images are transcribed with a vision LLM (ChatGPT), then processed by a structuring LLM to extract fields (Date, Location when present, Entry Text), preserving on-page numbering as separate entries. Printed or typed PDFs are processed with pdfplumber and light NLP (line joining, de-hyphenation, header/footer removal) and split into numbered entries. Spreadsheets from collaborators are mapped to a canonical schema through header normalisation and date parsing, while retaining all original columns for provenance. Across all streams, original date strings and orthography are preserved, source text is paired with an English gloss, overlaps are de-duplicated, and concise topic labels are assigned using Google Gemini 2.5 Pro (Reasoning); topics are then aggregated into parent–child families.

Files

LFCD dataset - report.pdf

Files (5.3 MB)

Name Size Download all
md5:0ffb405df4c77f7810f28ad2c158d447
2.0 MB Preview Download
md5:2f8cdf9d88becf973190d4c7dca1f2bf
2.5 MB Preview Download
md5:47d53ba10d5a9517f0d1eeb44c250229
893.3 kB Download

Additional details

Dates

Created
2026-06-03
LFCD dataset