Published February 28, 2026 | Version 3.0.0
Dataset Open

UzLegalNER v3_fixed: Uzbek Legal Contracts Named Entity Recognition Dataset (PER/ORG/LOC/POSITION/DATE/MONEY/DOCNO)

  • 1. ROR icon Novosibirsk State University

Description

UzLegalNER v3_fixed is a named entity recognition (NER) dataset for Uzbek legal contracts and related official documents. The dataset uses a seven-label schema: PER, ORG, LOC, POSITION, DATE, MONEY, DOCNO. We release: (i) a master spreadsheet (XLSX) with sentence-level metadata and character-level entity spans, (ii) a JSONL version with span annotations, and (iii) CoNLL BIO splits (train/dev/test) for standard NER training and benchmarking.

Key fields: sent_id (unique per sentence), doc_id (document/group identifier for doc-level splitting), doc_type, script (latin), split, text, and entities (start/end/label/text). Overlapping/nested spans are removed for CoNLL compatibility (the longest span is retained).

Intended use: training and evaluating Transformer-based NER models and gazetteer-enhanced methods, with a particular focus on robustness to unseen entity surface forms in legal text.

Files

changelog.md

Files (473.2 kB)

Name Size Download all
md5:145de68609e49ec881b087e67a97547b
416 Bytes Preview Download
md5:b2ce03e50accc4f4f0e6b7552b36847e
463 Bytes Download
md5:3609fb9574c55dddbb0d076aa5300559
2.0 kB Preview Download
md5:f5a1dd0ec6dc3844c2df36d15e76032f
453 Bytes Download
md5:281fb88dc31452580f9cf3108341cebe
2.3 kB Preview Download
md5:373ec995c0ec5689c046278f194d24c1
467.6 kB Preview Download

Additional details

Software

Programming language
Python
Development Status
Active