There is a newer version of the record available.

Published January 18, 2026 | Version v1
Preprint Open

Humanity's Last Hallucination : A Forensic Audit of the Scientific Insolvency in GPQA and HLE

Authors/Creators

Description

 Forensic Audit Report: Scientific Insolvency in GPQA & HLE
法医级审计报告:GPQA 与 HLE 的科学失格

Title: Humanity’s Last Hallucination: A Forensic Audit of the Scientific Insolvency in GPQA and HLE
Date: January 2026
Type: Independent Forensic Audit / Technical Report
Files: English Edition (Primary) & Chinese Edition (Original)

---

Critical Audit Findings / 核心审计结论

This study conducts a forensic audit of GPQA (Diamond)and Humanity's Last Exam (HLE), widely regarded as "gold standards" for AGI benchmarking. Through a four-phase progressive verification process, we reveal severe scientific insolvency:

本研究对被广泛奉为机器智能极限“金标准”的 GPQA (Diamond)HLE 进行了法医级审查。通过四阶段递进式验证,我们揭示了以下严重结论:

*   GPQA Diamond: Inherent error rate lower bound is 26.8%.
    *   Forensic Characterization: "An advanced intellectual booby trap from the old era."
    *   GPQA 固有错误率下界:26.8%。
    *   定性: “旧时代的高级捕兽夹。”

*   Humanity's Last Exam (HLE): Inherent error rate lower bound is 58.0% (37.2% overall).
    *   Forensic Characterization: "A scientific ruin dismembered by adversarial filtering mechanisms."
    *   HLE 固有错误率下界:58.0%(整体 37.2%)。
    *   定性: “被对抗性筛选机制肢解的科学废墟。”

*   Verdict: The ruler used to measure AI evolution is severely distorted. Detailed evidence of "factual errors," "missing parameters," and "transcription mistakes" is documented in the full report.
*   最终判决: 我们用来丈量 AI 进化的尺子本身已严重扭曲。报告全文详尽记录了相关的“事实性错误”、“参数缺失”与“转录谬误”。

 "The game of science is, in principle, without end. He who decides one day that scientific statements do not call for any further test... retires from the game." — Sir Karl Popper

---

Abstract / 摘要

[English]
This study conducts a forensic audit of GPQA and HLE. Through a four-phase progressive verification process, we reveal severe scientific insolvency: the inherent error rate lower bounds for GPQA and HLE are as high as 26.8% and 58.0%, respectively. These systematic fallacies stem from factual errors, missing parameters, and transcription mistakes. Our research shows that GPQA is essentially an "advanced intellectual booby trap," while HLE is a "scientific ruin." These benchmarks have transformed from rulers for measuring intelligence into noise generators.

[中文]
本研究对被广泛奉为机器智能极限“金标准”的 GPQA 与 HLE 进行了法医级审查。验证揭示二者存在严重的科学失格:GPQA 与 HLE 的固有错误率下界分别高达 26.8% 与 58.0%。这些系统性谬误源于事实性错误、参数缺失与书写谬误。研究表明,GPQA 实为“旧时代的高级捕兽夹”,而 HLE 则是“被筛选机制肢解的科学废墟”。这两大基准已从测量智能的标尺,异化为测量模型对逻辑谬误拟合程度的噪音发生器。

---

Integrity & Verification / 完整性验证

* Please refer to `Audit_Integrity_Record.txt` for the MD5/SHA256 signatures of the manuscript to ensure the chain of custody.
* 请参阅附件中的 `Audit_Integrity_Record.txt` 获取文件的加密哈希值,以验证审计报告的完整性。

---

Reproducibility Roadmap / 复现计划

The verification toolchain is fully functional and archived.

Status:
The author is an independent researcher. To prevent accidental leakage of personal credentials and ensure the scripts run smoothly in community environments, the codebase is currently undergoing API key removal and environment configuration decoupling.

> Release Schedule: The full code (including 139 verification scripts) will be validated for clean execution and released within two weeks.

代码发布计划:
作为独立研究者,为防止个人 API Key 及敏感配置意外泄露,并确保脚本在社区环境中通用的可执行性,目前正在进行密钥移除与适配环境工作。完整验证代码库将在两周内完成检查并发布。

Keywords: GPQA, HLE, Benchmark Insolvency, AI Safety, DeepSeek-Overclock, Forensic Audit, Inherent Error Rate, AGI Theory

Table of contents

1. Results: Dataset Quality Audit & Model Performance Ladder
2. Introduction
3. Research Methodology: Four-phase Progressive Verification
4. Exemplary Case Studies: From "Arduous" to "Absurd"
5. Why GPQA & HLE Exhibit Scientific Insolvency
6. Limitations and Generational Reconstruction
7. Acknowledgments
Appendix: Audit Integrity Record

Methods

This audit employs a four-phase progressive verification method:
Phase I: Data Extraction and Preliminary Validation (Failure saturation sampling).
Phase II: Deep Validation and Manual Review (Dual-Judge Model Verification).
Phase III: Comprehensive Validation and Error Rate Estimation (Brute-force mathematical verification).
Phase IV: Result Consolidation and Forensic Analysis.

Toolchain: DeepSeek-Overclock (dsoc) for logical probing; Independent Python verification environments.
Audit Philosophy: Popperian Falsifiability ("Logic as the Sole Arbiter").

Files

人类最后幻觉_对 GPQA 与 HLE 科学失格的法医级清算_v1.pdf

Additional details

Additional titles

Translated title
人类最后幻觉:对 GPQA 与 HLE 科学失格的法医级清算

References

  • Karl R. Popper. The Logic of Scientific Discovery. Trans. by Ru Qiang Zha, Ren Zong Qiu, and Muchun Wan. Hangzhou: China Academy of Art Press, 2008.
  • L.D.Landau and E. M. Lifshitz. Statistical Physics, Part 1. 3rd. Course of Theoretical Physics, Vol. 5. Oxford: Butterworth-Heinemann, 1980.
  • Bradley W. Carroll and Dale A. Ostlie. An Introduction to Modern Astro physics. 2nd. Cambridge: Cambridge University Press, 2017.
  • Robin Hartshorne. Algebraic Geometry. Graduate Texts in Mathematics, No. 52. New York: Springer-Verlag, 1977.
  • SammyZeng. Policy Entropy as Order Parameter: Landau Theory Migra tion for Intelligence Dynamics Framework. In: (July 2025). Version 1.2.
  • DavidReinetal.GPQA:AGraduate-LevelGoogle-ProofQ&ABenchmark. In: arXiv preprint (2023). eprint: 2311.12022v1.
  • Long Phan et al. Humanity's Last Exam. In: arXiv preprint (2025). eprint: 2501.14249v9.
  • DeepSeek-AI. DeepSeek-V3 Technical Report. In: arXiv preprint (2024). eprint: 2412.19437.
  • DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via ReinforcementLearning.In:arXiv preprint(2025).eprint:2501.12948.
  • DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Lan guage Models. In: arXiv preprint (2025). eprint: 2512.02556.
  • DeepSeek-AI.DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. Experimental Report. DeepSeek-AI, 2025.
  • HaoranWei,YaofengSun,andYukunLi.DeepSeek-OCR:ContextsOptical Compression. In: arXiv preprint (2024). eprint: 2412.20303.
  • Zhihong Shao et al. DeepSeekMath-V2: Towards Self-Verifiable Mathemat ical Reasoning. In: arXiv preprint (2025). eprint: 2511.22570.
  • Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. Pervasive Label Er rors inTestSetsDestabilizeMachineLearningBenchmarks.In:arXiv preprint (2021). eprint: 2103.14749v4.
  • Samuel R. Bowman and George E. Dahl. What Will it Take to Fix Bench marking in Natural Language Understanding? In: arXiv preprint (2021). eprint: 2104.02145v3.
  • DouweKielaetal.Dynabench:RethinkingBenchmarkinginNLP.In:arXiv preprint (2021). eprint: 2104.14337v1.
  • Luyu Gao et al. PAL: Program-aided Language Models. In: arXiv preprint (2023). eprint: 2211.10435v2.
  • Hunter Lightman et al. Let's Verify Step by Step. In: arXiv preprint (2023). eprint: 2305.20050v1.
  • XuezhiWangetal.Self-Consistency Improves Chain of Thought Reasoning in LanguageModels.In:International Conference on Learning Representations (ICLR). 2023.
  • Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are Emergent Abili ties of Large Language Models a Mirage? In: arXiv preprint (2023). eprint: 2304.15004v2.
  • Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the Imi tation Game: Quantifying and Extrapolating the Capabilities of Language Models. In: arXiv preprint (2022). eprint: 2206.04615.
  • Jason Wei et al. Chain of Thought Prompting Elicits Reasoning in Large LanguageModels.In:Proceedings of the 39th International Conference on Machine Learning. PMLR. 2022, pp. 22140–22154.
  • StephenBrunauer,P.H.Emmett,andEdwardTeller.AdsorptionofGasesin Multimolecular Layers. In: Journal of the American Chemical Society 60.2 (Feb. 1938), pp. 309–19.
  • Yixin Nie et al. Adversarial NLI: A New Benchmark for Natural Language Understanding. In: arXiv preprint (2019). eprint: 1910.14599.
  • Yubo Wanget al. MMLU-Pro: A More Robust and Challenging Multi-Task LanguageUnderstandingBenchmark.In:arXiv preprint(2024).eprint:2406. 01574v6.