Published May 1, 2025 | Version v1
Dataset Open

SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training

Description

๐Ÿ”ฌ Dataset Description – SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training

SynapseSet is a large-scale synthetic dataset designed to bridge quantitative EEG signal analysis with clinical natural language interpretation. Comprising 10K, 50K, and 100K paired examples, it simulates realistic EEG parameters across more than 25 neurological conditions including epilepsy, sleep stages, cognitive disorders, and neurodegenerative diseases and maps them to detailed clinical narratives suitable for training instruction-tuned large language models (LLMs).

Each entry includes:

  • Core EEG frequency band data (delta, theta, alpha, beta, gamma)

  • Patient demographics and recording parameters

  • High-fidelity clinical interpretations with domain-specific reasoning

Key Features:

  • ๐Ÿง  Diverse pathology coverage reflecting real-world neurophysiology

  • โš™๏ธ JSON/JSONL format optimized for supervised instruction-based NLP tasks

  • ๐Ÿงช Designed for research in clinical NLP, biomedical reasoning, and multimodal AI

Important Note: SynapseSet is fully synthetic and intended strictly for research and educational purposes. Models trained using this dataset must be validated against real EEG data before any clinical application. Limitations include the absence of temporal dynamics, simplified spatial modeling, and potential distributional bias due to simulated data generation methods.

SynapseSet offers a controlled, richly annotated foundation for investigating how language models can interface with neural signals, while emphasizing transparency, reproducibility, and ethical use.

Files

SynapseSet_ A Synthetic EEG-to-Text Dataset for Neural Language Model Training.pdf

Files (174.3 MB)

Name Size Download all
md5:7d11d8eda0b84437225b711fd25b3ab2
113.3 MB Download
md5:eb3e890017dd9357695caa13c704bca8
10.6 MB Download
md5:cb808063e41f2c2e87cbf273730343b4
50.3 MB Download
md5:470310917ca8b0f53810eb661d165459
197.8 kB Preview Download

Additional details

Software

Repository URL
https://huggingface.co/NextGenC
Programming language
Python
Development Status
Active