SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training
Authors/Creators
Description
๐ฌ Dataset Description – SynapseSet: A Synthetic EEG-to-Text Dataset for Neural Language Model Training
SynapseSet is a large-scale synthetic dataset designed to bridge quantitative EEG signal analysis with clinical natural language interpretation. Comprising 10K, 50K, and 100K paired examples, it simulates realistic EEG parameters across more than 25 neurological conditions including epilepsy, sleep stages, cognitive disorders, and neurodegenerative diseases and maps them to detailed clinical narratives suitable for training instruction-tuned large language models (LLMs).
Each entry includes:
-
Core EEG frequency band data (delta, theta, alpha, beta, gamma)
-
Patient demographics and recording parameters
-
High-fidelity clinical interpretations with domain-specific reasoning
Key Features:
-
๐ง Diverse pathology coverage reflecting real-world neurophysiology
-
โ๏ธ JSON/JSONL format optimized for supervised instruction-based NLP tasks
-
๐งช Designed for research in clinical NLP, biomedical reasoning, and multimodal AI
Important Note: SynapseSet is fully synthetic and intended strictly for research and educational purposes. Models trained using this dataset must be validated against real EEG data before any clinical application. Limitations include the absence of temporal dynamics, simplified spatial modeling, and potential distributional bias due to simulated data generation methods.
SynapseSet offers a controlled, richly annotated foundation for investigating how language models can interface with neural signals, while emphasizing transparency, reproducibility, and ethical use.
Files
SynapseSet_ A Synthetic EEG-to-Text Dataset for Neural Language Model Training.pdf
Additional details
Software
- Repository URL
- https://huggingface.co/NextGenC
- Programming language
- Python
- Development Status
- Active