Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published March 29, 2023 | Version v2.0.0-beta.3
Dataset Restricted

GermaParl Corpus of Plenary Protocols

  • 1. University of Duisburg-Essen

Description

The GermaParl Corpus of Parliamentary Protocols covers 72 years of debates in the German Bundestags, from the first meeting on September 7,1949 up to the last meeting of the 19th legislative period on September 7, 2021. The corpus has been prepared in the PolMine Project (http://polmine.github.io), building on an initial GermaParl version covering the period between 1996 and 2016.

GermaParl v2.0.0-beta.3 prepares the release of GermaParl as a consolidated corpus of parliamentarism in the Federal Republic of Germany envisaged for May 2023. Beta users need to request access to the corpus and are asked to give feedback on any issues they encounter, so that GermaParl v2.0.0 will be a trustworthy, high-quality resource for research on parliamentary proceedings.

The essence of GermaParl is a transformation of the raw material available from the Bundestag (plaint text, pdf and XML documents) into an XML data format that approximates standards envisaged by the Text Encoding Initiative (TEI). Yet this is a technically demanding data format that requires users to be able to run a text processing pipeline on a considerably large dataset (4340 plenary protocols / 270 million words).

We therefore start the beta testing phase with offering access to the CWB edition of GermaParl: TEI-style data have been linguistically annotated (using Stanford CoreNLP and further tools) and been converted into the binary data format of the Corpus Workbench (CWB). Users benefit from an performant and established machinery for working with structurally and linguistically annotated data, including the rich query syntax of the Corpus Query Processor (CQP). While it is possible to work with tools such as CWB , CQPweb or TXM, we recommend the polmineR R package, which has some specialized functionality for parliamentary data.

Beta users need to pursue the following steps to get started:

Request access: Click the respective button on this page. A personal invitation to serve as a beta tester is not required to be eligible. A short and telling note on the research interest you associate with GermaParl Beta will help us to make a quick decision.

Confirm Email: You will then receive an email from Zenodo to verify your email-address. It may take a while (up to an hour) until you receive this message. If you still do not find it in your Inbox, check the SPAM folder of your mail account. Confirm your Email address.

Confirmation of access: We need to confirm data access manually. We will consider incoming requests on a continuous basis, but please allow 2-3 days for a response.  Our response to your request to gain access to the resource will also include information on resources you might find useful to get started with GermaParl.

Join GitHub issue tracker: To process feedback systematically, we use the issue tracker of a private GitHub repository. We will invite you to be a collaborator of this repository. To be able to invite you to the GitHub repository, we will ask you to provide us with your GitHub account. Please consider creating a GitHub account if you do not yet have one.

Download and install corpus: Once we have confirmed data access, Zenodo will send you an Email with a download link. Retain this download link. We suggest to install the corpus using functionality newly added to the R package cwbtools. Make sure that you install the latest version of cwbtools (min. v0.3.8) and insert the download you have received into the code snippet provided below. A proper internet connection is advisable: The size of the corpus tarball is 2,3 GB. 

# insert download link
zenodo_url <- "INSERT-ZENODO-LINK-HERE"

# install development version of cwbtools
install.packages("cwbtools") # v0.3.8 or later

# install corpus
library(cwbtools)
tmp_tarball <- zenodo_get_tarball(url = zenodo_url)
corpus_install(tarball = tmp_tarball)

# install polmineR
install.packages("polmineR") # v0.8.8

# check installation
library(polmineR)
corpus("GERMAPARL2")

If you have not used CWB indexed corpora before, the installation process will suggest and create directories for data storage. This involves defining the environment variable CORPUS_REGISTRY permanently for future R sessions.

Explore GermaParl and give feedback: We have taken great care to offer a high-quality research resource. But given the size of the data, it is impossible to manually check the data throughout. Remaining errors are to be expected. The early access which beta users get comes with the hope that you will provide feedback. Your feedback will help us to prepare an improved release candidate, and finally the consolidated official release of GermaParl we envisage for October 2022.

Acknowledgements:

  • We gratefully acknowledge funding from the German National Research Data Infrastructure (Nationale Forschungsdaten-Infrastruktur / NFDI). Funding from KonsortSWD has advanced the data preparation tool set to facilitate the robust annotation of additional annotation layers in large corpora (such as Named Entities). This is instrumental for linking parliamentary data with other data. Funding from the Text+ consortium is instrumental for updates of the corpus, quality control and keeping data formats up with current and future developments.
  • The data quality of GermaParl we are able to offer at this stage has benefitted significantly from a cooperation with the SOLDISK project at the University of Hildesheim, and comprehensive manual quality control of the data carried out by the SOLDISK team. A very special thanks goes to Hannes Schammann, Max Kisselew, Franziska Ziegler, Carina Böker, Jennifer Elsner and Carolin McCrea.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

If you have not been invited to serve as a beta user, please write a short and telling note on the research interest you associate with GermaParl. Access is restricted to academic users. In line with the envisaged Creative Commons licence of the official release (CC BY-SA), and results you report should refer to the PolMine Project and the authors of the corpus. Beta users are not authorized to share the data and are asked to share any issues they encounter using the issue tracker we offer for this purpose.

You are currently not logged in. Do you have an account? Log in here