Extracting Provenance Events from TEI-encoded Historical Correspondence
Description
Provenance modelling standards are built around completed events: acquisitions that happened, custody transfers that were finalised. This assumption does not hold for source documents that record events as they unfold. Three questions drive this research: how to extract provenance events from historical correspondence reproducibly; how to represent the full event lifecycle, including requests, negotiations, and failed transfers, using existing semantic standards; and what gaps those standards present for modelling events before their resolution.
The case study is the Canneti-Fiacchi correspondence (1711–1730), a 600+ letter exchange between two Camaldolese monks, Pietro Canneti and Mariangelo Fiacchi, documenting the formation of the Biblioteca Classense in Ravenna). Of the full corpus, 133 letters are available in TEI/XML encoding. In this corpus, book transfers are rarely recorded as concluded facts: a letter more often captures a request sent, a price under negotiation, or a deal that may never close. Reconstructing provenance from such sources requires treating incomplete events as primary data, a dimension absent from existing LOD approaches to correspondence and object provenance .
A pilot extraction on 60 letters produced 95 validated provenance events. Two minimum thresholds govern inclusion: items require 2 of 3 criteria (work identification, quantity, physical characteristics); events require 1 of 4 provenance endpoints (origin or destination, agent or location). Events are mapped to a three-layer representational framework: CER-Ontology for epistolary structure (Seibold et al.), CIDOC-CRM for provenance events, and LRMoo for bibliographic objects. Of the 95 events, 36 (38%) are completed, and 59 (62%) are planned or under negotiation; the majority document intermediate phases.
The primary representational gap is temporal: CIDOC-CRM provides E8_Acquisition for ownership transfers and E10_Transfer_of_Custody for custody changes, but both classes describe events that have already occurred. Linked Art acknowledges the limit directly: "the model currently cannot describe the future planned event." The proposed approach encodes lifecycle states via P2_has_type within the existing E8/E10 framework. Three further gaps emerged from the corpus (chains of intermediary agents, non-monetary payments such as masses and services, and approximate price qualifiers such as circa and al massimo), for which three minimal ontology extensions are proposed. Whether P2_has_type suffices for lifecycle states or new subclasses are required will be tested when extraction scales to the full 133-letter corpus.
Files
extracting-prov-events-TEI-historical-correspondence.pdf
Files
(488.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:d8222cea0ab50ec3c272c7aa5c5d7325
|
488.3 kB | Preview Download |