SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training

7enn Labs; Abdullah, Kocaman

doi:10.5281/zenodo.15316250

Published May 1, 2025 | Version v1

Dataset Open

SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training

🔬 Dataset Description – SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training

SynapseSet is a large-scale synthetic dataset designed to bridge quantitative EEG signal analysis with clinical natural language interpretation. Comprising 10K, 50K, and 100K paired examples, it simulates realistic EEG parameters across more than 25 neurological conditions including epilepsy, sleep stages, cognitive disorders, and neurodegenerative diseases and maps them to detailed clinical narratives suitable for training instruction-tuned large language models (LLMs).

Each entry includes:

Core EEG frequency band data (delta, theta, alpha, beta, gamma)
Patient demographics and recording parameters
High-fidelity clinical interpretations with domain-specific reasoning

Key Features:

🧠 Diverse pathology coverage reflecting real-world neurophysiology
⚙️ JSON/JSONL format optimized for supervised instruction-based NLP tasks
🧪 Designed for research in clinical NLP, biomedical reasoning, and multimodal AI

Important Note: SynapseSet is fully synthetic and intended strictly for research and educational purposes. Models trained using this dataset must be validated against real EEG data before any clinical application. Limitations include the absence of temporal dynamics, simplified spatial modeling, and potential distributional bias due to simulated data generation methods.

SynapseSet offers a controlled, richly annotated foundation for investigating how language models can interface with neural signals, while emphasizing transparency, reproducibility, and ethical use.

Files