Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20416449

Published May 27, 2026 | Version v1

Report Open

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored. To address this gap, we present the first systematic evaluation of 12 LLMs, including both reasoning-focused and general-purpose models, across 17 Chinese and English legal tasks spanning statutory and case-law traditions. In addition, we curate a bilingual chain-of-thought dataset

Research goal: What is the scaling behavior of test-time compute (chain-of-thought length) versus accuracy gains for DeepSeek-R1 and o1-preview across multilingual legal reasoning tasks (e.g., Chinese vs. English) on the 17-task benchmark?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.2/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.2/10.

Files

paper.pdf

Files (85.2 kB)

Name	Size	Download all
paper.pdf md5:a3ac64e220594170940c93ab0015c344	85.2 kB	Preview Download

	All versions	This version
Views	13	13
Downloads	6	6
Data volume	596.7 kB	596.7 kB

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a

Authors/Creators

Description

Notes

Files

paper.pdf

Files (85.2 kB)