Conference paper Open Access
In the NWO REPUBLIC project, we are creating digital access to the corpus of the Resolutions of the States General of the Dutch Republic (1576-1796). This corpus contains the decisions made in the States General each day for a 220 year period. The resolutions were recorded using a standard structure and contain many standard formulations for aspects of the decision making process, including the source of the topic that was decided on (a formal request, a missive, etc.), whether a decision was reached and what that decision was.
We discuss different techniques we use to identify formulaic expressions and how we iteratively build a corpus-specific phrase model with which we can identify 1) the dates and attendants of each meeting, which are followed by all the resolutions of that day, 2) resolution boundaries, e.g. where they start and stop in the running text, so we know which text belongs to which resolution, 3) different types of opening phrases that correspond to different types of sources (e.g. requests, missives, reports, etc.), and 4) the decision paragraphs that state what decision, if any, was reached.
We discuss how we built ground truth to evaluate the phrase model and the fuzzy searching and extraction process. Finally, we discuss how this approach generalised to other corpora and text genres.