TurkishLegalBench: A Comprehensive Multi-Task Benchmark Suite for Turkish Legal NLP
Description
TurkishLegalBench is a large-scale, multi-task benchmark suite specifically designed for Natural Language Processing (NLP) in the Turkish legal domain. The dataset consists of 38,009 authentic legal documents curated from Turkish high courts. It covers seven distinct legal tasks categorized under three pillars: Classification (TurkVerdict, TurkVenue, TurkCanon), Extraction (TurkChronos, TurkCite), and Reasoning (TurkCoherence, TurkAudit). This resource aims to bridge the gap in low-resource legal NLP for Turkish and provides a standardized evaluation framework for researchers.
Methods
Under Review: This repository contains the official dataset and codebase for the paper "TurkLexBench: A Comprehensive Multi-Task Benchmark Suite for Turkish Legal NLP", currently under review for KDD 2026 (Datasets & Benchmarks Track). While the data is open for reproducibility, please cite the work if you use it.
The Tasks (The 3 Pillars)
We organize the benchmark into three pillars representing different levels of legal cognition:
I. The Gavel (High-Level Classification)
| Task | Description | Metric | Size (Train/Dev/Test) |
| TurkVerdict | Predict the judgment outcome (e.g., Affirmation, Reversal) from the case rationale. | Macro-F1 | 18k / 2.2k / 2.2k |
| TurkVenue | Identify the competent court chamber (Daire) based on case facts (36 classes). | Macro-F1 | 19.2k / 2.7k / 5.5k |
| TurkCanon | Classify legislative documents into types (Law, Regulation, Decree, etc.). | Macro-F1 | 6.3k / 0.9k / 1.8k |
II. The Quill (Information Extraction)
| Task | Description | Metric | Size (Train/Dev/Test) |
| TurkChronos | Identify the decision year of a case amidst distractor dates. | Accuracy | 19.4k / 2.7k / 5.5k |
| TurkCite | Extract citations (Law No. & Article No.) from unstructured text (NER). | Entity F1 | 10.9k / 1.5k / 3.1k |
III. The Scale (Legal Reasoning)
| Task | Description | Metric | Size (Train/Dev/Test) |
|---|---|---|---|
| TurkCoherence | Natural Language Inference (NLI) to check if the reasoning supports the verdict. | Macro-F1 | 4.9k / 0.7k / 1.4k |
| TurkAudit | Detect "legal hallucinations" and anachronistic citations (e.g., citing a 2016 law in 2010). | Weighted-F1 | 7k / 1k / 2k |
Benchmark Results
We evaluated baseline and domain-adapted models across all 7 tasks using the Test Set. The table below reports the primary metric for each task (Macro-F1 for classification, Accuracy for Chronos, and Entity-F1 for NER).
| Model | Verdict (m-F1) |
Venue (m-F1) |
Canon (m-F1) |
Chronos (Acc) |
Cite (F1) |
Coherence (m-F1) |
Audit (W-F1) |
| TFIDF-SVM | 68.4 | 41.7 | 35.8 | 56.0 | 76.4 | - | 87.6 |
| BERTurk | 81.9 | 79.4 | 92.6 | 94.1 | 95.5 | 56.8 | 96.9 |
| Legal-BERT (Eng) | 68.1 | 63.1 | 91.1 | 92.3 | 95.5 | 50.1 | 95.1 |
| XLM-RoBERTa | 79.2 | 64.4 | 70.4 | 70.2 | 95.4 | 33.7 | 96.6 |
| Longformer | 69.3 | 59.9 | 92.6 | 93.7 | 94.5 | 43.9 | 98.5 |
| BERT-TR-128k | 82.6 | 82.2 | 93.8 | 94.0 | 96.0 | 50.4 | 97.6 |
- The BERT-TR-128k model (with expanded vocabulary) achieves State-of-the-Art (SOTA) in 4 out of 7 tasks, significantly outperforming standard BERTurk in extraction and rare-class classification tasks.
- Longformer dominates the TurkAudit task (%98.5), proving that large context windows are essential for "needle-in-a-haystack" retrieval tasks where the anomaly appears late in the document.
- All models struggle with the TurkCoherence (NLI) task, indicating that current LLMs are better at surface-level pattern matching than deep legal logical entailment.
License
Under this license, you are free to:
- Share — copy and redistribute the material in any medium or format.
- Adapt — remix, transform, and build upon the material.
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- NonCommercial — You may not use the material for commercial purposes.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Note: The underlying raw texts (court decisions and laws) are public records. This license applies to the curated benchmark, annotations, and structured dataset created by the authors.
Citation
If you use TurkLexBench (data, code, or models) in your research, please cite our paper:
@inproceedings{erkan2026turklexbench,
title={TurkLexBench: A Comprehensive Multi-Task Benchmark Suite for Turkish Legal NLP},
author={Erkan, Mehmet Ali and Yozgatlıgil, Ceylan},
booktitle={Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26)},
year={2026},
publisher={ACM},
doi={10.5281/zenodo.18555735},
url={https://doi.org/10.5281/zenodo.18555735},
note={Under Review}
}
Files
TurkAudit.zip
Files
(267.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:98e12a4de602e23c6c941626bee03aeb
|
17.1 MB | Preview Download |
|
md5:7db311ec7ba4367ad8b45159d5afbad0
|
62.6 MB | Preview Download |
|
md5:a58b448efd845fb0e7e32d5ca7c22d34
|
48.7 MB | Preview Download |
|
md5:36d9a27ab6145b559db60a360d396eef
|
43.5 MB | Preview Download |
|
md5:5f079ce006697c78ba896413512f01c8
|
17.0 MB | Preview Download |
|
md5:e536bcaaf577cc13cea8edd1a9bde9d3
|
41.3 MB | Preview Download |
|
md5:3aeb3eca2944c302b5223055191a0e80
|
37.5 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/mrkn7/TurkishLegalBench
- Programming language
- Python