Dataset Open Access

CROCorp: Corpus of Parliamentary Debates in Croatia

Mochtak, Michal; Glaurdić, Josip; Lesschaeve, Christophe

The repository contains a cleaned and pre-processed corpus of parliamentary debates from the Croatian Parliament (Sabor). The corpus is accompanied by the metadata on elected representatives and their political parties. It covers the period of 2003-2020 (five complete terms) and counts over 500 thousand speeches.

If you use the dataset, please cite: Mochtak, Michal, Josip Glaurdić, and Christophe Lesschaeve (2022): CROCorp: Corpus of Parliamentary Debates in Croatia (v1.1.1), https://doi.org/10.5281/zenodo.6521372.

v1.1.1 (latest version)
- added the concept DOI to codebooks (DOI was generated only after the repository was published)

v1.1.0
- improved coding of dummy variable "moderator" (using less error-prone alghoritm for detecting the modertor role)
- fixed issue with agenda points which are conncatenated while preserving a unique web link
- recoded agenda points tags using better ML model (transformer architecture)

v1.0.0
- originally posted on GESIS repository (migrated to ZENODO due to limitations concerning the concept DOI)

Note: There is one corpus-specific peculiarity that comes from the nature of the data and reflects upon how the official database stores them. Some of the agenda points are discussed together and then split into separate trails, which makes the recorded transcripts overlap in parts they share. This approach is most apparent in the 9th term when the transcripts started to be recorded following this logic systematically. The official database stores the trails as unique links to preserve the within-case logic, which inevitably duplicates some of the speeches in the corpus. This leads to an artificial increase in the number of speeches that are not unique. Rather than removing duplicates that are shared across multiple agenda points right away and losing direct trace of agenda-focused discussions, they are kept in their original form. If needed, they can be easily filtered out based on date and reported agenda number.
Files (394.5 MB)
Name Size
CODEBOOK_CRO_corpus.pdf
md5:99299c11a65e0a4c28aca0f8e61d17b4
162.8 kB Download
CODEBOOK_CRO_mps.pdf
md5:7b2f74d99b37ea21119fa63fb81f72e9
160.5 kB Download
CODEBOOK_CRO_parties.pdf
md5:fec441be729e846fdc1cd3d8310e8995
159.1 kB Download
CRO_5_term_final.RDS
md5:9f64701b5b365f617a5df677bbd66430
76.2 MB Download
CRO_6_term_final.RDS
md5:910a26d15b3d8ac7cdfac45dea0a7e41
66.9 MB Download
CRO_7_term_final.RDS
md5:fadd0e7f3e40dca5fa3f8000a3abefd2
93.6 MB Download
CRO_8_term_final.RDS
md5:48b3d069967a30926697a46c9c2b9a8b
8.9 MB Download
CRO_9_term_final.RDS
md5:6b1c3b7e1fb0f850a3eaaf33c7b17648
148.3 MB Download
Croatia_MPs_final.xlsx
md5:f44528cbe4320cde847d1f436d85b4c1
132.9 kB Download
Croatia_parties_final.xlsx
md5:895d878c4b200a0455b72c4d5b11de53
13.4 kB Download
86
101
views
downloads
All versions This version
Views 8664
Downloads 10178
Data volume 1.5 GB1.4 GB
Unique views 6858
Unique downloads 5545

Share

Cite as