Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches
Authors/Creators
Description
This is the official replication repository for the paper Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models published in the European Journal of Political Research. It contains raw datasets for training and validating the authdetect replication scripts, a quick walkthrough of the model (YT tutorial), and a complete Jupyter notebook for using the model on users' own data in Google Colab.
authdetect is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric.
The model is finetuned on top of roberta-base model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, the model utilizes a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]).
Video tutorial
The official repository includes a comprehensive walkthrough tutorial that demonstrates how to use the authdetect model. This tutorial is designed to help users quickly analyze their data with ease. By downloading the interactive Jupyter notebook and the sample data (how_to_use_authdetect.ipynb, sample_data.csv), anyone can follow the step-by-step instructions and run the pipeline effortlessly using Google Colab, enabling them to try it themselves and get results in no time. The whole process can also be followed in a tutorial video available at: https://www.youtube.com/watch?v=CRy9uxMChoE.
HuggingFace
The model is also uploaded on Hugging Face, where users can easily download it and take advantage of the existing support for seamless implementation. The Zenodo repository contains the model solely for archival purposes related to the paper. Additionally, the Hugging Face archive includes a minimalistic example demonstrating the model's application. For a complete pipeline, users are encouraged to utilize the interactive notebook and watch the tutorial available on YouTube. You can explore the repository at: https://huggingface.co/mmochtak/authdetect.
If you use the repository, please cite:
@article{mochtak_chasing_2024,
title = {Chasing the authoritarian spectre: {Detecting} authoritarian discourse with large language models},
issn = {1475-6765},
shorttitle = {Chasing the authoritarian spectre},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6765.12740},
doi = {10.1111/1475-6765.12740},
journal = {European Journal of Political Research},
author = {Mochtak, Michal},
keywords = {authoritarian discourse, deep learning, detecting authoritarianism, model, political discourse},
}
Files
_model_card.pdf
Files
(1.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:a316f8421ce4d2579e142ef25dc2d241
|
108.2 kB | Preview Download |
|
md5:f32d822060bc45502d5ab729bdb6dbae
|
455.5 MB | Preview Download |
|
md5:2c724aa9ee7259c48dfc13b5c8ccbfe9
|
519.6 kB | Preview Download |
|
md5:ab682398d999c219ee196b8082694176
|
20.9 kB | Preview Download |
|
md5:cf3a5ea3584b6897c6047db10597d693
|
1.1 GB | Preview Download |
|
md5:3820e7976c929c79aabba9cc5dd07c46
|
2.6 MB | Preview Download |
Additional details
Additional titles
- Alternative title
- Replication repository for "Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models"
Related works
- Is published in
- Publication: 10.1111/1475-6765.12740 (DOI)
Funding
- Radboud University Nijmegen
- Radboud Excellence Fellowship 2702184
Dates
- Accepted
-
2024-10-06Paper accepted for publication
Software
- Repository URL
- https://huggingface.co/mmochtak/authdetect
- Programming language
- R , Python
- Development Status
- Active