Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a
Description
Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored. To address this gap, we present the first systematic evaluation of 12 LLMs, including both reasoning-focused and general-purpose models, across 17 Chinese and English legal tasks spanning statutory and case-law traditions. In addition, we curate a bilingual chain-of-thought dataset
Research goal: What is the scaling behavior of test-time compute (chain-of-thought length) versus accuracy gains for DeepSeek-R1 and o1-preview across multilingual legal reasoning tasks (e.g., Chinese vs. English) on the 17-task benchmark?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.2/10.
Notes
Files
paper.pdf
Files
(85.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a3ac64e220594170940c93ab0015c344
|
85.2 kB | Preview Download |