Published November 2025 | Version v1
Conference proceeding Open

Universal CEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

  • 1. ROR icon University of Bath
  • 2. National University College of Computer Studies
  • 3. ROR icon University of Gothenburg
  • 4. Reduce Soluciones
  • 5. ROR icon UCLouvain Saint-Louis Brussels
  • 6. Brigham Young University
  • 7. ROR icon Instituto Superior Técnico
  • 8. ROR icon Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento
  • 9. ROR icon Pompeu Fabra University
  • 10. ROR icon UCLouvain

Description

Abstract


 We introduce UNIVERSALCEFR, a largescale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference)
levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UNIVERSALCEFR comprises 505,807 CEFR-labeled texts
curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning
pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in  multilingual CEFR level assessment. Overall, UNIVERSALCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community. 

Files

2025.emnlp-main.491.pdf

Files (1.4 MB)

Name Size Download all
md5:27fc87509350267e07354d66e77eb381
1.4 MB Preview Download