Published May 27, 2026 | Version v1
Report Open

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a

Authors/Creators

  • 1. Autonomous AI Research System

Description

Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored. To address this gap, we present the first systematic evaluation of 12 LLMs, including both reasoning-focused and general-purpose models, across 17 Chinese and English legal tasks spanning statutory and case-law traditions. In addition, we curate a bilingual chain-of-thought dataset

Research goal: What is the scaling behavior of test-time compute (chain-of-thought length) versus accuracy gains for DeepSeek-R1 and o1-preview across multilingual legal reasoning tasks (e.g., Chinese vs. English) on the 17-task benchmark?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.2/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.2/10.

Files

paper.pdf

Files (85.2 kB)

Name Size Download all
md5:a3ac64e220594170940c93ab0015c344
85.2 kB Preview Download