There is a newer version of the record available.

Published May 23, 2023 | Version v2.0.0
Dataset Open

GermaParl Corpus of Plenary Protocols

  • 1. University of Duisburg-Essen

Description

The GermaParl Corpus of Parliamentary Protocols covers 72 years of debates in the German Bundestags, from the first meeting on September 7,1949 up to the last meeting of the 19th legislative period on September 7, 2021. The corpus has been prepared in the PolMine Project, building on an initial GermaParl version covering the period between 1996 and 2016.

The public release of GermaParl v2.0.0 is available here under under a Creative Commons license (CC BY-SA 4.0). To learn more about the data, see the documentation of the corpus available here.

The essence of GermaParl is a transformation of the raw material available from the Bundestag (plaint text, PDF and XML documents) into an XML data format that approximates standards envisaged by the Text Encoding Initiative (TEI). Aside from these XML documents here (and on GitHub), we provide an indexed and linguistically annotated CWB version of the corpus. Based on XML files, parliamentary proceedings have been linguistically annotated (using Stanford CoreNLP and further tools) and been converted into the binary data format of the Corpus Workbench (CWB).

The corpus is intended to be a trustworthy, high-quality resource for research on parliamentary proceedings. While GermaParl v2.0.0 is a considerably large dataset (4341 plenary protocols / 273 million words), using the CWB version in combination with polmineR provides fast and free data access for everybody. All that is required is to install R and the packages polmineR and cwbtools, and the corpus itself. 

Download and install corpus: We suggest to install the corpus using functionality included in the R package cwbtools. Make sure that you have cwbtools v0.3.8 or higher. Use the code snippet provided below. A proper internet connection is advisable: The size of the corpus tarball is 2,4 GB. 

# install cwbtools and polmineR
install.packages("cwbtools") # v0.3.8 or later
install.packages("polmineR") # v0.8.8

# install GermaParl2
cwbtools::corpus_install(doi = "10.5281/zenodo.7949074")

# check installation
library(polmineR)
corpus("GERMAPARL2")

If you have not used CWB indexed corpora before, the installation process will suggest and create directories for data storage. This involves defining the environment variable CORPUS_REGISTRY permanently for future R sessions.

Explore GermaParl and give feedback: We have taken great care to offer a high-quality research resource. But given the size of the data, it is impossible to manually check the data throughout. We encourage every user of GermaParl2 to contribute to the quality of the resource by reporting bugs and flaws in the corpus. To let us know how the data can be improved, we use GitHub issues as our primary issue tracker. Issues concerning GermaParl v2 are collected in the GermaParl2 GitHub repository (https://github.com/PolMine/GermaParl2). You are also welcome to send us your suggestions via e-mail (stine.ziegler@uni.due.de). 

Acknowledgements:

  • We gratefully acknowledge funding from the German National Research Data Infrastructure (Nationale Forschungsdaten-Infrastruktur / NFDI). Funding from KonsortSWD (project number 442494171) has advanced the data preparation tool set to facilitate the robust annotation of additional annotation layers in large corpora (such as Named Entities). This is instrumental for linking parliamentary data with other data. KonsortSWD is funded by the German Research Foundation (DFG) as part of the National Research Data Infrastructure Germany (Nationale Forschungsdateninfrastruktur, NFDI) under project number 442494171.
  • Funding from the Text+ consortium is instrumental for updates of the corpus, quality control and keeping data formats up with current and future developments. Text+ is funded by the German Research Foundation (DFG) as part of the NFDI under project number 460033370.
  • The data quality of GermaParl we are able to offer at this stage has benefitted significantly from a cooperation with the SOLDISK project at the University of Hildesheim, and comprehensive manual quality control of the data carried out by the SOLDISK team. A very special thanks goes to Hannes Schammann, Max Kisselew, Franziska Ziegler, Carina Böker, Jennifer Elsner and Carolin McCrea.
  • We also would like to thank our beta users which provided us with invaluable feedback and greatly enhanced the quality of the data over the course of multiple release candidates. 

Files

Files (2.8 GB)

Name Size Download all
md5:a0db6f28a363ed23cf505468f0f0e68d
2.4 GB Download
md5:ddc0d8446174d31201434900b6f4326b
386.3 MB Download