Published March 3, 2026 | Version v1

LLM-Driven Text Augmentation across Media and Languages

  • 1. ROR icon Jožef Stefan Institute
  • 2. ROR icon University of Ljubljana
  • 3. Jožef Stefan Institute, Jamova cesta 39

Description

The proliferation of fake news across social media, headlines, and news articles poses major challenges for automated detection, particularly in multilingual and cross-media settings affected by data imbalance. We propose a fake news detection framework based on LLM-driven, feature-guided text augmentation. The method generates realistic synthetic samples across languages, media types, and text granularities while preserving factual structure and stylistic coherence. Experiments with classical and transformer-based models (Random Forest, Logistic Regression, BERT, XLM-R) across social media, headline, and multilingual news datasets show consistent improvements in performance. LLM-based augmentation improves overall accuracy by up to 1.6% over imbalanced baselines and increases minority-class F1-scores by up to 2.4% in low-resource languages such as Swahili. Hybrid fact- and style-based models achieve up to 93.8% accuracy with more balanced class-wise F1-scores and reduced language-related disparities, demonstrating improved robustness and cross-lingual generalization.

FakeNewsNet Headlines Dataset: https://github.com/KaiDMML/FakeNewsNet
Kaggle Fake News Dataset (Politics vs News): https://www.kaggle.com/c/fake-news
Twitter Fake News Dataset: https://figshare.com/articles/dataset/Twitter_dataset/28069163/1
TALLIP Multilingual Fake News Dataset: https://tallip.fake-news-dataset

Files

synthetic_articles.csv

Files (10.4 MB)

Name Size Download all
md5:5652a348f1bac3469e8d25af2ee3481e
8.0 MB Preview Download
md5:c4e2039268253d01f70b146037ed3461
937.7 kB Preview Download
md5:f83f170c075e674dcdf6e23aa37aa2f8
403.6 kB Preview Download
md5:e725bb92acd043eb1bd9fcfd251c3882
1.1 MB Preview Download

Additional details

Funding

European Commission
TWON - TWin of Online Social Networks 101095095

Software

Repository URL
https://github.com/abdulsittar/Fairer_Models
Programming language
Python