Published May 9, 2022 | Version 1.1.1
Dataset Open

CROCorp: Corpus of Parliamentary Debates in Croatia

  • 1. University of Luxembourg

Description

The repository contains a cleaned and pre-processed corpus of parliamentary debates from the Croatian Parliament (Sabor). The corpus is accompanied by the metadata on elected representatives and their political parties. It covers the period of 2003-2020 (five complete terms) and counts over 500 thousand speeches.

If you use the dataset, please cite: Mochtak, Michal, Josip Glaurdić, and Christophe Lesschaeve (2022): CROCorp: Corpus of Parliamentary Debates in Croatia (v1.1.1), https://doi.org/10.5281/zenodo.6521372.

v1.1.1 (latest version)
- added the concept DOI to codebooks (DOI was generated only after the repository was published)

v1.1.0
- improved coding of dummy variable "moderator" (using less error-prone alghoritm for detecting the modertor role)
- fixed issue with agenda points which are conncatenated while preserving a unique web link
- recoded agenda points tags using better ML model (transformer architecture)

v1.0.0
- originally posted on GESIS repository (migrated to ZENODO due to limitations concerning the concept DOI)

Notes

Note: There is one corpus-specific peculiarity that comes from the nature of the data and reflects upon how the official database stores them. Some of the agenda points are discussed together and then split into separate trails, which makes the recorded transcripts overlap in parts they share. This approach is most apparent in the 9th term when the transcripts started to be recorded following this logic systematically. The official database stores the trails as unique links to preserve the within-case logic, which inevitably duplicates some of the speeches in the corpus. This leads to an artificial increase in the number of speeches that are not unique. Rather than removing duplicates that are shared across multiple agenda points right away and losing direct trace of agenda-focused discussions, they are kept in their original form. If needed, they can be easily filtered out based on date and reported agenda number.

Files

CODEBOOK_CRO_corpus.pdf

Files (394.5 MB)

Name Size Download all
md5:99299c11a65e0a4c28aca0f8e61d17b4
162.8 kB Preview Download
md5:7b2f74d99b37ea21119fa63fb81f72e9
160.5 kB Preview Download
md5:fec441be729e846fdc1cd3d8310e8995
159.1 kB Preview Download
md5:9f64701b5b365f617a5df677bbd66430
76.2 MB Download
md5:910a26d15b3d8ac7cdfac45dea0a7e41
66.9 MB Download
md5:fadd0e7f3e40dca5fa3f8000a3abefd2
93.6 MB Download
md5:48b3d069967a30926697a46c9c2b9a8b
8.9 MB Download
md5:6b1c3b7e1fb0f850a3eaaf33c7b17648
148.3 MB Download
md5:f44528cbe4320cde847d1f436d85b4c1
132.9 kB Download
md5:895d878c4b200a0455b72c4d5b11de53
13.4 kB Download

Additional details

Funding

ELWar – Electoral Legacies of War: Political Competition in Postwar Southeast Europe 714589
European Commission