Published June 9, 2022 | Version v2
Dataset Open

A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis

  • 1. Athens University of Economics & Business, Greece
  • 2. Stockholm University, Sweden

Description

The dataset is a new version of the previous upload and includes the following files:

1. dataset_versions/tell_all.csv: The initial dataset of 1,280,927 extracted speeches, before preprocessing and cleaning. The speeches extend chronologically from July 1989 up to July 2020 and were exported from 5,355 parliamentary sitting record files. The file has a total volume of 2.5 GB and includes the following columns:

  • member_name: the name of the individual who spoke during a sitting.
  • sitting_date: the date the sitting took place.
  • parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions.
  • parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings.
  • parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members.
  • political_party: the political party of the speaker.
  • government: the government in force when the speech took place.
  • member_region: the electoral district the speaker belonged to.
  • roles: information about the parliamentary roles and/or government position of the speaker.
  • member_gender: the gender of the speaker
  • speech: the speech that the individual gave during the parliamentary sitting.

2. dataset_versions/tell_all_FILLED.csv: This file is an intermediate version of the dataset that includes improvements in the consistency and completeness of the dataset, with a total volume of 2.5 GB. Specifically, this file is produced by filling the missing names of chairmen of various parliamentary sittings of the "tell_all.csv". It includes the same columns as the "tell_all.csv" file.

3. dataset_versions/tell_all_cleaned.csv: This version of the dataset is the result of further cleaning and preprocessing and is used for our word usage change study. It consists of 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, with a total volume of 2.12 GB. It includes the same columns as the aforementioned versions. The preprocessing includes the replacement of all references to political parties with the symbol "@" followed by an abbreviation of the party name, using regular expressions that capture different grammatical cases and variations. It also includes the removal of accents, strings with length less than 2 characters, all punctuation except full stops, and the replacement of stopwords with "@sw".

4. wiki_data:  A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json".

5. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a
list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender.

6. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender.

7. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs.

8. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament, party leaders, opposition leaders and other information.

9. all_members_activity.csv: A dataset of all the information of the aforementioned files 3,4,5,6 merged. Each row of the file includes the full name of the individual, the start and end date of their term of office, the political party and electoral district they belonged to, their gender, the parliamentary and/or government positions that they held along with start and end dates, and the name of the government that was in power during their term of office. An individual can change political parties or become an independent member of the parliament during a parliamentary period, thus having more than one entries/rows in the file.

10. freqs_for_semantic_shift_cleaned_data_decade1990.csv & freqs_for_semantic_shift_cleaned_data_decade2010.csv: Files of frequencies of words in the corpora of the decades 1990-1999 and 2010-2019.

11. compass_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Compass tool by V. D. Carlo et. al. [1].

12. compass_fc_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Compass tool [1] in combination with the frequency cut-offs of the Gonen et. al. approach [3]. For the frequency cut-offs, the files in bullet 8 are used.

13. procrustes_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Orthogonal Procrustes approach of Hamilton et. al. [2].

14. nn_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Gonen et. al. approach [3].

15. second_order_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Second-Order Similarity approach by Hamilton et. al. [4].

16. top100_minfreq50.xls: An .xls file for convinient viewing of the top 100 most changed words per approach with minimum frequency of 50 occurrences, produced by merging the aforementioned files 11, 12, 13, 14, 15 and 16.

17. freqs_for_semantic_shift_cleaned_data_period1997_2007.csv & freqs_for_semantic_shift_cleaned_data_period2008_2018.csv: Files of frequencies of words in the corpora of the decades before (1997_2007) and during (2008_2018) the Greek economic crisis.

18. semantic_shifts_dichotomy_crisis_compass_1997_2007_2008_2018_atleast50.csv: A file with the top 100 most changed words between between the decades before (1997-2007) and during (2008-2018) the Greek economic crisis. The computations are implemented with the use of the Compass tool.

19. selected_topics_shift_per_period_compass.csv: The usage change of selected topics/words of generic political interest between pairs of consecutive parliamentary periods. The computations are implemented with the use of the Compass tool.

20. semantic_shifts_party_embeddings_per_period_merged_compass.csv: The usage change of selected political party names that have played an important role in recent political history, namely New Democracy (ND), the Panhellenic Socialist Movement (PASOK), the Coalition of the Radical Left - Progressive Alliance (SYRIZA), the Communist Party of Greece (KKE), the Coalition of the Left, of Movements and Ecology (SYN) and Golden Dawn (GD).

-------------

Citations:

[1] Valerio Di Carlo, Federico Bianchi, and Matteo Palmonari. Training Temporal Word Em- beddings with a Compass. In Proceedings of the Thirty–Third AAAI Conference on Artificial Intelligence, AAAI’19, pages 6326–6334, 2019. doi: 10.1609/aaai.v33i01.33016326.

[2] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2016, pages 1489– 1501, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10. 18653/v1/P16-1141. URL https://www.aclweb.org/anthology/P16-1141.

[3] Hila Gonen, Ganesh Jawahar, Djamé Seddah, and Yoav Goldberg. Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pages 538– 555, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl- main.51. URL https://aclanthology.org/2020.acl-main.51.

[4] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pages 2116–2121, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1229. URL https://www.aclweb.org/anthology/D16-1229.

-------------

Acknowledgments:

This work was supported by the European Union’s Horizon 2020 research and innovation program ``FASTEN'' under grant agreement No 825328 and the non profit data journalism organization iMEdD.org.

Files

Greek Parliament Proceedings Dataset_Support Files_Word Usage Change Computations.zip

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.6626316 (DOI)

Funding

European Commission
FASTEN - Fine-Grained Analysis of Software Ecosystems as Networks 825328