Published October 14, 2025 | Version v1
Dataset Open

Cognitive Load Classification and Real-Time Intervention for Enhanced Vocabulary Learning

  • 1. ROR icon Hasso Plattner Institute

Contributors

Contact person:

  • 1. ROR icon Hasso Plattner Institute

Description

Dataset Description

The dataset (approximately 40 hours in total) consists of physiological signals from wearable electroencephalography (EEG), electrodermal activity (EDA), photoplethysmogram (PPG), acceleration, and temperature sensors, as well as log files from a computerized vocabulary E-Learning application. Data was recorded from 10 completely anonymized participants who performed computerized E-Learning vocabulary learning, designed to induce mental workload while learning from four different, unknown languages. Physiological signals were obtained from the Muse S EEG headband and the Empatica E4 wristband.

Experimental Setup:

Session 1 (Baseline): Physiological data were recorded during tasks designed to induce labeled states of [overload, underload, high interest, and low interest]. This labeled data was used to develop a personalized machine learning model for classifying subjective cognitive load.

Session 2 (Intervention): The personalized model classified the participant's subjective cognitive load level in real-time. Based on these results, the E-Learning application was adjusted to steer the participant's cognitive load level and increase learning performance. As such, words were added to or removed from the vocabulary list, and the time each word was shown or the respective number of repetitions was adjusted. 

Labels: Self-reported labels were obtained using Likert scales (for subjective cognitive load and stress), NASA-TLX (for overall workload), and PANAS (for affective state), in addition to performance metrics extracted from the log files.

Vocabulary: Six languages were chosen: Esperanto, Hinglish, Nahuatl, Pinjin, Spanish, and Turkish. It was ensured that participants were unfamiliar with the respective language prior to enrollment. The study was performed in accordance with the local institute review board's ethical guidelines and the Declaration of Helsinki.

The completely anonymized dataset is publicly available and offers vast potential to the research community working on mental workload detection using consumer-grade wearable sensors. Among other applications, the data is suitable for developing real-time cognitive load detection methods, researching signal processing techniques, or investigating ML-adjusted E-Learning applications.

The link to the publication will be added here once the manuscript is accepted in the respective journal.

 

Technical Info

The anonymized data is located in the top-level subfolder 'data'. Within this, the subfolders 'P001_1st_session' through 'P010_2nd_session' contain data from individual participants across their respective first and second sessions (i.e., suffixes '_1st_session' and '_2nd_session').

For each participant-session folder, multiple numerically named subfolders (0, 1, ...) exist, representing distinct recording runs in case an application had to be restarted. In these subfolders, a respective 'RawData' folder contains the sensor files. The main log file for a session (e.g., 'p009_2nd_session_anonymized.log'), located in the main session folder, holds the time-aligned labels for all runs.

Per recording, the following anonymized files exist with the suffix '_anonymized.csv':

Empatica E4: 'ACC.csv', 'BVP.csv', 'GSR.csv', and 'TEMP.csv'

Muse S: 'ACC.csv', 'EEG.csv', 'GYRO.csv', and 'PPG.csv'

Additionally, the folder 'features_and_labels_pckls' contains pre-processed data, extracted features, and the respective labels for the extracted time-windows, all in .pkl format (e.g., 'P001_S1.pkl', 'P001_S1_with_all_info.pkl').

Finally, the folder 'compiled_results_and_interview_answers' contains the anonymized learning outcomes, overall questionnaire results, and interview answers in .csv or .ods format.

 

Source Code

The full Source Code necessary to reproduce the study's results is available at: https://github.com/HPI-CH/cl_intervention_e_learning_2025.

The repository includes Python scripts to load and process data, extract features, and perform the machine learning analysis on the anonymized data. The Anaconda environment used to run the code can be reproduced with the file E_Learning_conda_env.yml.

Core scripts are categorized by function:

Anonymization: anonymize_data.py and anonymize_log_files.py (used to replace sensitive information like absolute timestamps with relative times or placeholders).

Experiment Setup: eye_closing.py, logging_utilities.py, questionnaires.py, stream_reader_chronjob.py, vocabulary.py, and wait_some_time.py (files needed to run the computerized E-Learning application).

ML Analysis: features_and_labels_extractor_for_multivariate_time_series_regression_on_anonymized_data.py, ml_and_preprocessing.py, and the various _classification and _regression files (scripts required for data preprocessing and all machine learning analyses).

 

Contact

Finally, please feel free to reach out should you encounter any issues or have any open questions regarding this data set, the experimental paradigm, the source code, or the publication. You can reach the authors via the contact information provided in the publication or via email to 'christoph.anders@hpi.de', 'christoph.anders@hpi.uni-potsdam.de', 'office-arnrich@hpi.uni-potsdam.de', or 'e_learning_2025@hpi.de'.

Files

compiled_results_and_interview_answers.zip

Files (1.1 GB)

Name Size Download all
md5:bdd9d9aeb5a89706a9d07cfe19519178
33.5 kB Preview Download
md5:86bbb021b6781756a57864b49a7b63e4
1.1 GB Preview Download
md5:1b84a1f1f2c47a9dbff498a6f4a4913f
1.2 MB Preview Download

Additional details

Dates

Collected
2025

Software

Repository URL
https://github.com/HPI-CH/cl_intervention_e_learning_2025
Programming language
Python
Development Status
Inactive