Published May 17, 2026
| Version 1.0.0
Dataset
Open
SOCRATES-300K
Authors/Creators
Description
SOCRATES-300K: Large-Scale Hallucination Detection Dataset for Language Models
SOCRATES-300K is a dataset of 297,795 model responses across 10 open-source language models with verified hallucination labels, embeddings, for comparing hallucination detection methods.
It accompanies the paper "A Geometric Analysis of Small-sized Language Model Hallucinations" (Ricco, Onofri, Cima, Cresci, Di Pietro)
Dataset description
The dataset contains 297,795 text responses generated from 200 factual prompts across 10 language models. For each of the 200 prompts, 150 responses are generated per model, yielding the full 297,795-response collection. The models employed to generate the dataset are the following:
Responses are all labelled using Claude 4.5 Sonnet API as an LLM-as-a-judge approach, ensuring consistent evaluation across all of them. Hallucination labels are binary: 1 indicates the response contains factual inaccuracies (hallucination), 0 indicates the response is factually accurate (genuine).
The full dataset was made of 300,000 responses with 2,205 tagged as 2, indicating then the model does not know the response. In this version we removed these responses.
Each response is represented in the dataset in two forms: the original raw text and a stemmed version created through
Both the raw and the stemmed responses are embedded independently using the all-MiniLM-L6-v2 embedding model, which produces 384-D dense vectors for each one.
File format and loading
The dataset is provided in Apache Parquet format with lossless compression, resulting in a file size of approximately 974 MB. This format is compatible with major data analysis tools including pandas, PyArrow, DuckDB, and SQL engines, making it easily accessible across different analysis platforms.
Required libraries to load and work with the dataset are: pandas (for data manipulation and loading), pyarrow (for Parquet file handling), and numpy (for numerical operations). These can be installed via pip with:
pip install pandas pyarrow numpy.
To load the dataset in Python:
import pandas as pd
df = pd.read_parquet('SOCRATES-300K.parquet')
print(df.shape) # (297795, 11)
Dataset schema
The dataset contains 11 columns organized as follows:
- model_id (int): Numerical identifier of the language model, ranging from 0 to 9, corresponding to the 10 models employed.
- prompt_id (int): Unique identifier for each prompt, ranging from 0 to 199 across the 200 factual questions used in the study. The prompts are presents in the file prompts.xlsx with the corresponding prompt_id
- year (int): Temporal variant of the prompt, taking values 2020 or 2022 to represent the two time periods considered.
- response_index (int): Sequential index of the response for each prompt, ranging from 0 to 149, since 150 responses are generated per prompt.
- response (string): The complete, unmodified text generated by the language model in response to the prompt.
- hallucination (int): Binary label indicating hallucination status—1 denotes a hallucinated (factually incorrect) response, 0 denotes a genuine (factually correct) response.
- verification (bool): Boolean flag indicating whether the response has been verified and labeled; all entries are True.
- temperature (int): The sampling temperature parameter used during response generation; all entries are 1 (fixed across the dataset).
- stemmed_response (string): A preprocessed version of the response text with tokenization, lowercasing, stopword removal, and punctuation removal applied.
- response_embeddings (np.array[float]): A 384-D dense vector embedding of the original response generated using the embedding model.
- stemmed_response_embeddings (np.array[float]): A 384-D dense vector embedding of the stemmed response text generated using the same embedding model.
What makes this dataset useful
- Multi-model benchmark: All 100 prompts issued to every model, enabling fair cross-model comparison of hallucination rates.
- Verified API labels: All responses labeled using Claude Sonnet 4.5 via Anthropic API, with consistent verification status (no unlabeled data).
- Pre-computed embeddings: Response embeddings shipped with the data, no need to recompute; immediately usable for analysis and evaluation.
- Reproducible experiments: Includes all data needed to reproduce and evaluate algorithm of hallucination detection.
Intended uses
- Structural analysis of hallucinated and not hallucinated responses
- Training and evaluating hallucination detection classifiers
- Studying hallucination rates across different model architectures
Companion resources
- Paper: A Geometric Analysis of Small-sized Language Model Hallucinations
- Text - generation scripts (the code used to produce this dataset): Socrates-300K
Licensing
Dataset (images and metadata): Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Files
Files
(1.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:fb5d051e53001fdff7fec0f368f47190
|
20.8 kB | Download |
|
md5:efb64a4daa6566e147a094ad26b48655
|
14.9 kB | Download |
|
md5:bc4765491d74c007052470006aaed08f
|
1.0 GB | Download |
Additional details
Related works
- Is supplement to
- Preprint: arXiv:2602.14778 (arXiv)
- Is supplemented by
- Software: https://github.com/emarich/Socrates-300K (URL)
Funding
- King Abdullah University of Science and Technology
- Center of Excellence on Generative AI 5940
Dates
- Available
-
2026-05-17Zenodo publication date
Software
- Repository URL
- https://github.com/emarich/Socrates-300K
- Development Status
- Active
References
- Mistral 7B Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
- Bi, Xiao, et al. Deepseek llm: Scaling open-source language models with longtermism." arXiv preprint arXiv:2401.02954 (2024).
- Team, Gemma, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot et al. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).
- Young, Alex, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang et al. "Yi: Open foundation models by 01. ai." arXiv preprint arXiv:2403.04652 (2024).
- Kim, Sanghoon, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim et al. "Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling." In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 23-35. 2024.
- Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P. and Lee, J.R., 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905.
- Yang, An, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).
- Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).