Published July 23, 2025 | Version v1
Dataset Open

ESQAD: Educational Spanish Question-Answer Dataset

  • 1. Universidad Politécnica de Madrid
  • 2. Universidad Politécnica de Madrid Escuela Universitaria de Informática

Description

Spanish Question Answer Generation Dataset and Code

Description

This repository contains the dataset and source code developed for the article: "ESQAD: An Open Spanish Dataset for Curriculum-Aligned Question-Answer Generation in Educational Settings"

The resources include:
- A Spanish QAG dataset aligned with national curricula (EVAU).
- Automatically generated QAG pairs from literary and legal sources.
- A pilot study subset with questions validated by teachers and students.

Dataset Structure

1. EVAU

- File: `evau/docs/EvAU_QA.csv`
- Description: Manually curated questions and answers aligned with the Spanish *Evaluación para el Acceso a la Universidad (EVAU)*.
Columns: `question`, `answer`, `subject`, `difficulty`
- Purpose: Benchmark for educational QAG tasks in Spanish.

2. Quijote

File: `quijote/docs/Quijote_QA.csv`
- Description: Automatically generated QAG pairs from *Don Quijote de la Mancha*.
Columns: `question`, `answer`, `chapter`, `difficulty`
- Purpose: Evaluation of QAG performance on literary texts.

3. Legal FAQs

File: `legal_faqs/docs/Legal_QA.csv`
- Description: Questions and answers extracted and generated from FAQs related to Spanish laws (*Ley 39/2015* and *Ley 40/2015*).
- Columns: `question`, `answer`, `law_reference`
Purpose: Testing QAG in legal and administrative contexts.

4. Exams (Pilot Study)

- File: `exams/exams_QA_validated.json`
- Description: 923 automatically generated QAG pairs evaluated by teachers and students during a pilot study:
Ratings: Clarity, complexity, pedagogical value (1–3 scale).
Difficulty: Intended vs perceived difficulty levels.
Comments: Free-text feedback from users.
- Purpose: Benchmark for evaluating QAG quality with human-validated data.

 

Citation

This dataset accompanies the article:

Badenes-Olmedo, C., Eyzaguirre-Barreda, P., Chu-Artzt, N., & Gayoso-Cabada, J. (2025).  
"ESQAD: A Curriculum-Aligned Spanish Dataset for Educational Question Answering
submitted to Computer Speech & Language (Elsevier, 2025)

Please cite this resource using the article (once published), or refer to this Zenodo DOI in the meantime.

License

Datasets: CC BY 4.0  
- Source Code: MIT License

 

Contact

- Carlos Badenes-Olmedo: carlos.badenes@upm.es  
- Noa Chu-Artzt: noa.chu.artzt@alumnos.upm.es  
- Paul Eyzaguirre-Barreda: paul.eyzaguirre@alumnos.upm.es
- Joaquin Gayoso-Cabada: j.gayoso@upm.es  

Files

QAG_Spanish_ExpertSystems_v1.0.zip

Files (676.1 MB)

Name Size Download all
md5:ab6d34257ca2f8d2daa6c6dcd6f238a2
676.1 MB Preview Download