Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches

Mochtak, Michal

doi:10.5281/zenodo.13920400

Published October 11, 2024 | Version 1.0

Model Open

Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches

Mochtak, Michal

This is the official replication repository for the paper Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models published in the European Journal of Political Research. It contains raw datasets for training and validating the authdetect replication scripts, a quick walkthrough of the model (YT tutorial), and a complete Jupyter notebook for using the model on users' own data in Google Colab.

authdetect is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric.

The model is finetuned on top of roberta-base model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, the model utilizes a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]).

Video tutorial

The official repository includes a comprehensive walkthrough tutorial that demonstrates how to use the authdetect model. This tutorial is designed to help users quickly analyze their data with ease. By downloading the interactive Jupyter notebook and the sample data (how_to_use_authdetect.ipynb, sample_data.csv), anyone can follow the step-by-step instructions and run the pipeline effortlessly using Google Colab, enabling them to try it themselves and get results in no time. The whole process can also be followed in a tutorial video available at: https://www.youtube.com/watch?v=CRy9uxMChoE.

HuggingFace

The model is also uploaded on Hugging Face, where users can easily download it and take advantage of the existing support for seamless implementation. The Zenodo repository contains the model solely for archival purposes related to the paper. Additionally, the Hugging Face archive includes a minimalistic example demonstrating the model's application. For a complete pipeline, users are encouraged to utilize the interactive notebook and watch the tutorial available on YouTube. You can explore the repository at: https://huggingface.co/mmochtak/authdetect.

If you use the repository, please cite:

@article{mochtak_chasing_2024,
    title = {Chasing the authoritarian spectre: {Detecting} authoritarian discourse with large language models},
    issn = {1475-6765},
    shorttitle = {Chasing the authoritarian spectre},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6765.12740},
    doi = {10.1111/1475-6765.12740},
    journal = {European Journal of Political Research},
    author = {Mochtak, Michal},
    keywords = {authoritarian discourse, deep learning, detecting authoritarianism, model, political discourse},
}

Files

_model_card.pdf

Files (1.6 GB)

Name	Size
_model_card.pdf md5:a316f8421ce4d2579e142ef25dc2d241	108.2 kB	Preview Download
authdetect.zip md5:f32d822060bc45502d5ab729bdb6dbae	455.5 MB	Preview Download
how_to_use_authdetect.ipynb md5:2c724aa9ee7259c48dfc13b5c8ccbfe9	519.6 kB	Preview Download
license_cc-by-nc-sa-40.txt md5:ab682398d999c219ee196b8082694176	20.9 kB	Preview Download
replication_data_code.zip md5:cf3a5ea3584b6897c6047db10597d693	1.1 GB	Preview Download
sample_data.csv md5:3820e7976c929c79aabba9cc5dd07c46	2.6 MB	Preview Download

Additional details

Alternative title: Replication repository for "Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models"

Is published in: Publication: 10.1111/1475-6765.12740 (DOI)

Radboud University Nijmegen
Radboud Excellence Fellowship 2702184

Accepted: 2024-10-06

Paper accepted for publication

Repository URL: https://huggingface.co/mmochtak/authdetect
Programming language: R , Python
Development Status: Active

	All versions	This version
Views	502	331
Downloads	823	386
Data volume	134.1 GB	57.9 GB

_model_card.pdf

Files (1.6 GB)

Additional titles

Related works

Funding

Dates

Software

Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches

Authors/Creators

Description

Files

_model_card.pdf

Files (1.6 GB)

Additional details

Additional titles

Related works

Funding

Dates

Software