Cognitive Load Classification and Real-Time Intervention for Enhanced Vocabulary Learning
Authors/Creators
Description
Dataset Description
The dataset (approximately 40 hours in total) consists of physiological signals from wearable electroencephalography (EEG), electrodermal activity (EDA), photoplethysmogram (PPG), acceleration, and temperature sensors, as well as log files from a computerized vocabulary E-Learning application. Data was recorded from 10 completely anonymized participants who performed computerized E-Learning vocabulary learning, designed to induce mental workload while learning from four different, unknown languages. Physiological signals were obtained from the Muse S EEG headband and the Empatica E4 wristband.
Experimental Setup:
Session 1 (Baseline): Physiological data were recorded during tasks designed to induce labeled states of [overload, underload, high interest, and low interest]. This labeled data was used to develop a personalized machine learning model for classifying subjective cognitive load.
Session 2 (Intervention): The personalized model classified the participant's subjective cognitive load level in real-time. Based on these results, the E-Learning application was adjusted to steer the participant's cognitive load level and increase learning performance. As such, words were added to or removed from the vocabulary list, and the time each word was shown or the respective number of repetitions was adjusted.
Labels: Self-reported labels were obtained using Likert scales (for subjective cognitive load and stress), NASA-TLX (for overall workload), and PANAS (for affective state), in addition to performance metrics extracted from the log files.
Vocabulary: Six languages were chosen: Esperanto, Hinglish, Nahuatl, Pinjin, Spanish, and Turkish. It was ensured that participants were unfamiliar with the respective language prior to enrollment. The study was performed in accordance with the local institute review board's ethical guidelines and the Declaration of Helsinki.
The completely anonymized dataset is publicly available and offers vast potential to the research community working on mental workload detection using consumer-grade wearable sensors. Among other applications, the data is suitable for developing real-time cognitive load detection methods, researching signal processing techniques, or investigating ML-adjusted E-Learning applications.
The link to the publication will be added here once the manuscript is accepted in the respective journal.
Technical Info
The anonymized data is located in the top-level subfolder 'data'. Within this, the subfolders 'P001_1st_session' through 'P010_2nd_session' contain data from individual participants across their respective first and second sessions (i.e., suffixes '_1st_session' and '_2nd_session').
For each participant-session folder, multiple numerically named subfolders (0, 1, ...) exist, representing distinct recording runs in case an application had to be restarted. In these subfolders, a respective 'RawData' folder contains the sensor files. The main log file for a session (e.g., 'p009_2nd_session_anonymized.log'), located in the main session folder, holds the time-aligned labels for all runs.
Per recording, the following anonymized files exist with the suffix '_anonymized.csv':
Empatica E4: 'ACC.csv', 'BVP.csv', 'GSR.csv', and 'TEMP.csv'
Muse S: 'ACC.csv', 'EEG.csv', 'GYRO.csv', and 'PPG.csv'
Additionally, the folder 'features_and_labels_pckls' contains pre-processed data, extracted features, and the respective labels for the extracted time-windows, all in .pkl format (e.g., 'P001_S1.pkl', 'P001_S1_with_all_info.pkl').
Finally, the folder 'compiled_results_and_interview_answers' contains the anonymized learning outcomes, overall questionnaire results, and interview answers in .csv or .ods format.
Source Code
The full Source Code necessary to reproduce the study's results is available at: https://github.com/HPI-CH/cl_intervention_e_learning_2025.
The repository includes Python scripts to load and process data, extract features, and perform the machine learning analysis on the anonymized data. The Anaconda environment used to run the code can be reproduced with the file E_Learning_conda_env.yml.
Core scripts are categorized by function:
Anonymization: anonymize_data.py and anonymize_log_files.py (used to replace sensitive information like absolute timestamps with relative times or placeholders).
Experiment Setup: eye_closing.py, logging_utilities.py, questionnaires.py, stream_reader_chronjob.py, vocabulary.py, and wait_some_time.py (files needed to run the computerized E-Learning application).
ML Analysis: features_and_labels_extractor_for_multivariate_time_series_regression_on_anonymized_data.py, ml_and_preprocessing.py, and the various _classification and _regression files (scripts required for data preprocessing and all machine learning analyses).
Contact
Finally, please feel free to reach out should you encounter any issues or have any open questions regarding this data set, the experimental paradigm, the source code, or the publication. You can reach the authors via the contact information provided in the publication or via email to 'christoph.anders@hpi.de', 'christoph.anders@hpi.uni-potsdam.de', 'office-arnrich@hpi.uni-potsdam.de', or 'e_learning_2025@hpi.de'.
Files
compiled_results_and_interview_answers.zip
Additional details
Dates
- Collected
-
2025
Software
- Repository URL
- https://github.com/HPI-CH/cl_intervention_e_learning_2025
- Programming language
- Python
- Development Status
- Inactive