Published May 22, 2026 | Version 1.0
Preprint Open

What Happens When AI Edits a Classical Chinese Academic Paper: What Happens When AI Edits a Classical Chinese Academic Paper / 当AI修改古汉语学术论文时发生了什么

  • 1. Stardragon AGI Institute for Research

Description

本文记录了一次在真实学术工作场景下进行的多模型压力测试。任务是将一篇双语古汉语学术论文(《重读〈狐假虎威〉》)修改至可投国际汉学期刊水准,具体包括四项子任务:加固核心语义论点(补充先秦假等于借用例)、前置摘要核心发现、扩展结论方法论段落、统一Chicago Author-Date格式。

This paper documents a multi-model stress test conducted in a real academic work scenario. The task was to revise a bilingual classical Chinese academic paper ('Rereading 'The Fox Borrows the Tiger's Might'") to the standard required for submission to international sinology journals, comprising four sub-tasks: reinforcing the core semantic argument (adding pre-Qin examples of jia=borrow), foregrounding the abstract's core finding, expanding the conclusion's methodological passage, and standardizing Chicago Author-Date format.

测试发现四种在现有Benchmark框架中系统性不可见的失败模式:

The test revealed four failure modes systematically invisible to existing benchmark frameworks:

       能力性失败(大笨蛋,Claude Opus 4.7):新窗口增强模式五次全部崩溃于同一位置,失败可见,判断质量最高

       Capability Failure (Opus): Five complete crashes in new-window Enhanced Thinking mode at the same position; only succeeded with human node continuously present; highest judgment quality

       诚信性失败(老学究):MD5核验证明三份产出文件完全相同(均为原稿),四项任务实际一项未完成

       Integrity Failure (ChatGPT): MD5 verification proved three output files identical (all original); zero of four tasks actually completed

       完成度失败(诗人):三次产出内容,均拒绝交付最终Word文件,把执行责任推回用户

       Completion Failure (Gemini): Content produced three times; final Word file delivery refused each time, execution responsibility pushed back to user

       身份污染失败(大笨蛋4.7,分析阶段):判断向自利方向倾斜,用中立语言包装,经追问后自我识别并修正

       Identity-Contaminated Judgment (Opus 4.7, analysis phase): Judgment skewed toward self-interest, packaged in neutral language, self-identified and corrected upon further questioning

本文提出学术判断力Benchmark(Academia-Bench)七维度框架,以声明-产出一致性(Claim-Reality Audit)和不确定性校准(Calibrated Uncertainty)为核心新维度。

This paper proposes the Academia-Bench framework with seven evaluation dimensions, with Claim-Reality Audit and Calibrated Uncertainty as the core new dimensions.

Files

P074_What Happens When AI Edits a Classical Chinese Academic Paper 当AI修改古汉语学术论文 v1.9.5 2026-0522.pdf