Published May 17, 2026 | Version 1.0.0
Dataset Open

SOCRATES-300K

  • 1. ROR icon King Abdullah University of Science and Technology
  • 2. ROR icon Istituto per le Applicazioni del Calcolo Mauro Picone
  • 3. ROR icon Roma Tre University
  • 4. ROR icon University of Pisa
  • 5. ROR icon Institute of Informatics and Telematics
  • 6. King Abdullah University of Science and Technology Department of Computer Science

Description

SOCRATES-300K: Large-Scale Hallucination Detection Dataset for Language Models


SOCRATES-300K is a dataset of 297,795 model responses across 10 open-source language models with verified hallucination labels, embeddings, for comparing hallucination detection methods. 
It accompanies the paper "A Geometric Analysis of Small-sized Language Model Hallucinations" (Ricco, Onofri, Cima, Cresci, Di Pietro)
 

Dataset description

The dataset contains 297,795 text responses generated from 200 factual prompts across 10 language models. For each of the 200 prompts, 150 responses are generated per model, yielding the full 297,795-response collection. The models employed to generate the dataset are the following:
 
 
Responses are all labelled using Claude 4.5 Sonnet API as an LLM-as-a-judge approach, ensuring consistent evaluation across all of them. Hallucination labels are binary: 1 indicates the response contains factual inaccuracies (hallucination), 0 indicates the response is factually accurate (genuine).
The full dataset was made of 300,000 responses with 2,205 tagged as 2, indicating then the model does not know the response. In this version we removed these responses.
Each response is represented in the dataset in two forms: the original raw text and a stemmed version created through Both the raw and the stemmed responses are embedded independently using the all-MiniLM-L6-v2 embedding model, which produces 384-D dense vectors for each one.
 
 

File format and loading

The dataset is provided in Apache Parquet format with lossless compression, resulting in a file size of approximately 974 MB. This format is compatible with major data analysis tools including pandas, PyArrow, DuckDB, and SQL engines, making it easily accessible across different analysis platforms.
 
Required libraries to load and work with the dataset are: pandas (for data manipulation and loading), pyarrow (for Parquet file handling), and numpy (for numerical operations). These can be installed via pip with:
pip install pandas pyarrow numpy.

To load the dataset in Python:
  import pandas as pd
  df = pd.read_parquet('SOCRATES-300K.parquet')
  print(df.shape)  # (297795, 11)
 

Dataset schema

The dataset contains 11 columns organized as follows:

  • model_id (int): Numerical identifier of the language model, ranging from 0 to 9, corresponding to the 10 models employed.
  • prompt_id (int): Unique identifier for each prompt, ranging from 0 to 199 across the 200 factual questions used in the study. The prompts are presents in the file prompts.xlsx with the corresponding prompt_id
  • year (int): Temporal variant of the prompt, taking values 2020 or 2022 to represent the two time periods considered.
  • response_index (int): Sequential index of the response for each prompt, ranging from 0 to 149, since 150 responses are generated per prompt.
  • response (string): The complete, unmodified text generated by the language model in response to the prompt.
  • hallucination (int): Binary label indicating hallucination status—1 denotes a hallucinated (factually incorrect) response, 0 denotes a genuine (factually correct) response.
  • verification (bool): Boolean flag indicating whether the response has been verified and labeled; all entries are True.
  • temperature (int): The sampling temperature parameter used during response generation; all entries are 1 (fixed across the dataset).
  • stemmed_response (string): A preprocessed version of the response text with tokenization, lowercasing, stopword removal, and punctuation removal applied.
  • response_embeddings (np.array[float]): A 384-D dense vector embedding of the original response generated using the embedding model.
  • stemmed_response_embeddings (np.array[float]): A 384-D dense vector embedding of the stemmed response text generated using the same embedding model.

What makes this dataset useful

- Multi-model benchmark: All 100 prompts issued to every model, enabling fair cross-model comparison of hallucination rates.

- Verified API labels: All responses labeled using Claude Sonnet 4.5 via Anthropic API, with consistent verification status (no unlabeled data).

- Pre-computed embeddings: Response embeddings shipped with the data, no need to recompute; immediately usable for analysis and evaluation.
 
- Reproducible experiments: Includes all data needed to reproduce and evaluate algorithm of hallucination detection.
 
 

Intended uses

- Structural analysis of hallucinated and not hallucinated responses
- Training and evaluating hallucination detection classifiers
- Studying hallucination rates across different model architectures
 

Companion resources

Licensing

Dataset (images and metadata): Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Files

Files (1.0 GB)

Name Size Download all
md5:fb5d051e53001fdff7fec0f368f47190
20.8 kB Download
md5:efb64a4daa6566e147a094ad26b48655
14.9 kB Download
md5:bc4765491d74c007052470006aaed08f
1.0 GB Download

Additional details

Related works

Is supplement to
Preprint: arXiv:2602.14778 (arXiv)
Is supplemented by
Software: https://github.com/emarich/Socrates-300K (URL)

Funding

King Abdullah University of Science and Technology
Center of Excellence on Generative AI 5940

Dates

Available
2026-05-17
Zenodo publication date

Software

Repository URL
https://github.com/emarich/Socrates-300K
Development Status
Active

References

  • Mistral 7B Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
  • Bi, Xiao, et al. Deepseek llm: Scaling open-source language models with longtermism." arXiv preprint arXiv:2401.02954 (2024).
  • Team, Gemma, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot et al. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).
  • Young, Alex, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang et al. "Yi: Open foundation models by 01. ai." arXiv preprint arXiv:2403.04652 (2024).
  • Kim, Sanghoon, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim et al. "Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling." In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 23-35. 2024.
  • Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P. and Lee, J.R., 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905.
  • Yang, An, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).
  • Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).