Published March 16, 2026 | Version v1
Preprint Open

When Languages Are Invisible to AI: Cross-Lingual Affective State Detection for Low-Resource Languages (Maithili & Bhojpuri)

Description

Over 100 million speakers of Maithili and Bhojpuri — two linguistically rich languages of the Indo-Aryan family spoken across Bihar, Jharkhand, and Uttar Pradesh — remain almost entirely invisible to modern natural language processing (NLP) systems. While transformer-based sentiment analysis has achieved near-human performance in English, we demonstrate that state-of-the-art monolingual English models collapse to random-chance performance (~33%) when applied to Maithili text, not through stochastic misclassification but through a systematic failure mode we term class collapse: the model produces NEUTRAL predictions for every input regardless of true sentiment polarity. Through attention-weight interpretability analysis, we reveal the precise mechanism: English BERT converts Devanagari script into [UNK] tokens, receiving zero semantic signal, and defaults to its learned neutral prior.
We present the first systematic cross-lingual affective state detection study across English, Hindi, Maithili, and Bhojpuri, introducing two original annotated corpora totalling over 73,000 examples. Our four key findings are: (1) multilingual pre-training (XLM-RoBERTa) recovers 35.3 percentage points over English BERT through script knowledge alone, with zero task-specific data; (2) native fine-tuning on as few as 3,563 carefully curated examples achieves 82.44% accuracy (F1 = 0.825), within 2.14 percentage points of the English ceiling of 84.58%; (3) a previously undocumented asymmetric transfer phenomenon exists between Maithili and Bhojpuri — transfer from Maithili to Bhojpuri (75.00%) substantially exceeds the reverse (47.33%), a 27.67 percentage-point gap attributable to differential orthographic standardisation and code-switching rates; and (4) attention analysis reveals the token-level mechanism of failure, demonstrating that fine-tuned models genuinely attend to negation markers and affect-bearing words rather than memorising surface patterns.
All datasets, trained model checkpoints, training notebooks, cross-evaluation scripts, and attention visualizations are publicly released at https://huggingface.co/abhiprd20.

Files

When_Languages_Are_Invisible_to_AI.pdf

Files (2.7 MB)

Name Size Download all
md5:a6d48f321ae1b3870d9b9a06f364b204
2.7 MB Preview Download

Additional details

Software

Repository URL
https://huggingface.co/abhiprd20
Development Status
Active