Published September 19, 2019 | Version v1
Presentation Open

Parla-CLARIN: TEI guidelines for corpora of parliamentary proceedings

  • 1. Jožef Stefan Institute
  • 2. Institute for Contemporary History

Description

Parliamentary proceedings (PP) are a rich source of data used by e.g. scholars in historiography, sociology, political science, linguistics, economics and economic history. As opposed to sources of most other language corpora, PP are not subject to copyright or personal privacy protections, and are typically available on-line thus making them ideal for compilation into corpora and open distribution. For these reasons many countries have already produced PP corpora, but each typically in their own encoding, thus limiting their comparability and utilisation in a multilingual setting.

The talk will overview current approaches to encoding PP, with a focus on TEI and TEI-like encodings, on Akoma Ntoso, a standard specifically designed for encoding PP and other legislative documents, and on RDF, also a common approach to encoding PPs. We then motivate and propose a TEI ODD (so, a schema parametrisation and guidelines) for such corpora, based on the TEI module for Transcriptions of Speech.  The work on this  Parla-CLARIN recommendation started with the “CLARIN ParlaFormat” workshop (cf. https://www.clarin.eu/blog/clarin-parlaformat-workshop) with selected participants who presented their own experiences with encoding parliamentary corpora and gave their comments to the draft proposal by the authors.

These comments have been largely taken into account, and the current Parla-CLARIN recommendation is available at https://github.com/clarin-eric/parla-clarin. The Git repository contains the ODD, the derived HTML guidelines and XML schemas, and example documents. The recommendation presents the encoding of PP metadata, including the speakers and political parties, the structure of the corpus, the encoding of the speeches and notes, linguistic annotation and multimedia.

The talk concludes with discussing further work, esp. the provision of a set of example documents, the conversion of Akoma Ntoso and RDF encoded PPs into Parla-CLARIN and vice-versa, and other transformation scripts that would operationalise the proposed encoding.

Files

teiParla___TEIC_2019__slides_.pdf

Files (338.4 kB)

Name Size Download all
md5:506a9e58a089332e70020cf0ecc466d8
338.4 kB Preview Download

Additional details

Funding

CLARIN – Common Language Resources and Technology Infrastructure 212230
European Commission