There is a newer version of the record available.

Published October 11, 2024 | Version 1.0
Model Open

Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches

Authors/Creators

Description

This is the official replication repository for the paper Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models published in the European Journal of Political Research. It contains raw datasets for training and validating the authdetect replication scripts, a quick walkthrough of the model (YT tutorial), and a complete Jupyter notebook for using the model on users' own data in Google Colab.

authdetect is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric.

The model is finetuned on top of roberta-base model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, the model utilizes a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]).

Video tutorial

The official repository includes a comprehensive walkthrough tutorial that demonstrates how to use the  authdetect model. This tutorial is designed to help users quickly analyze their data with ease. By downloading the interactive Jupyter notebook and the sample data (how_to_use_authdetect.ipynb, sample_data.csv), anyone can follow the step-by-step instructions and run the pipeline effortlessly using Google Colab, enabling them to try it themselves and get results in no time. The whole process can also be followed in a tutorial video available at: https://www.youtube.com/watch?v=CRy9uxMChoE.

HuggingFace

The model is also uploaded on Hugging Face, where users can easily download it and take advantage of the existing support for seamless implementation. The Zenodo repository contains the model solely for archival purposes related to the paper. Additionally, the Hugging Face archive includes a minimalistic example demonstrating the model's application. For a complete pipeline, users are encouraged to utilize the interactive notebook and watch the tutorial available on YouTube. You can explore the repository at: https://huggingface.co/mmochtak/authdetect.

If you use the repository, please cite:

@article{mochtak_chasing_2024,
    title = {Chasing the authoritarian spectre: {Detecting} authoritarian discourse with large language models},
    issn = {1475-6765},
    shorttitle = {Chasing the authoritarian spectre},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6765.12740},
    doi = {10.1111/1475-6765.12740},
    journal = {European Journal of Political Research},
    author = {Mochtak, Michal},
    keywords = {authoritarian discourse, deep learning, detecting authoritarianism, model, political discourse},
}

Files

_model_card.pdf

Files (1.6 GB)

Name Size Download all
md5:a316f8421ce4d2579e142ef25dc2d241
108.2 kB Preview Download
md5:f32d822060bc45502d5ab729bdb6dbae
455.5 MB Preview Download
md5:2c724aa9ee7259c48dfc13b5c8ccbfe9
519.6 kB Preview Download
md5:ab682398d999c219ee196b8082694176
20.9 kB Preview Download
md5:cf3a5ea3584b6897c6047db10597d693
1.1 GB Preview Download
md5:3820e7976c929c79aabba9cc5dd07c46
2.6 MB Preview Download

Additional details

Additional titles

Alternative title
Replication repository for "Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models"

Related works

Is published in
Publication: 10.1111/1475-6765.12740 (DOI)

Funding

Radboud University Nijmegen
Radboud Excellence Fellowship 2702184

Dates

Accepted
2024-10-06
Paper accepted for publication

Software

Repository URL
https://huggingface.co/mmochtak/authdetect
Programming language
R , Python
Development Status
Active