Universal CEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Authors/Creators
-
Imperial, Joseph Marvin1, 2
-
Baraean, Abdullah
-
Stodden, Regina
-
Wilkens, Rodrigo
-
Muñoz Sánchez, Ricardo3, 4
- Gao, Lingyun5
-
R. Toribio, Melissa Esther1
-
Reynolds, Robert6
-
Ribeiro, Eugénio7, 8
-
Saggion, Horacio9
-
Volodina, Elena3
-
Vajjala, Sowmya
-
François, Thomas10
-
Alva Manchego, Fernando
-
Tayyar Madabushi, Harish
-
1.
University of Bath
- 2. National University College of Computer Studies
-
3.
University of Gothenburg
- 4. Reduce Soluciones
-
5.
UCLouvain Saint-Louis Brussels
- 6. Brigham Young University
-
7.
Instituto Superior Técnico
-
8.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento
-
9.
Pompeu Fabra University
-
10.
UCLouvain
Description
Abstract
We introduce UNIVERSALCEFR, a largescale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference)
levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UNIVERSALCEFR comprises 505,807 CEFR-labeled texts
curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning
pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UNIVERSALCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
Files
2025.emnlp-main.491.pdf
Files
(1.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:27fc87509350267e07354d66e77eb381
|
1.4 MB | Preview Download |