Published January 3, 2026 | Version v4
Dataset Open

NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems

  • 1. ROR icon Iran University of Science and Technology

Description

Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmentation Tools in the Hadith Domain

📘 Overview

Noor-Ghateh is a manually annotated Classical Arabic morphological dataset derived from the jurisprudential text Sharayeʿ al-Islam.
The dataset provides fine-grained clitic segmentation, 15 morphological attributes, and gold-standard human annotation, making it a valuable benchmark for:

  • Morphological analyzers

  • Segmentation systems

  • Lemmatizers & root extractors

  • Classical Arabic NLP research

  • Benchmarking domain sensitivity across analyzers

The dataset includes 223,690 tokens, with a publicly available 313-token sample released in XML, JSON, and CSV-embedded-XML formats.

🧱 Data Format

1. XML Format (Primary)

The XML structure uses <Base> → <Root> → <word> hierarchy.
Each <word> element includes 14 morphological attributes such as:

  • Seq — morpheme order

  • Slice — surface form

  • Affix — prefix/suffix/stem

  • Pos, Lemma, Case, Categ, DervT, Num, Root

  • TOV, Time, Voic, Kol, Lang

2. JSON Format

Direct JSON mapping of the XML hierarchy for machine learning pipelines.

3. CSV-embedded XML

Each row contains:
Surface form — Segmented form — XML annotation block

🎯 Intended Use Cases

  • Training and evaluating morphological segmentation systems

  • Testing classical Arabic analyzers (Farasa, CAMeL Tools, ALP, MADAMIRA)

  • Building lemmatizers and root extractors

  • Domain-sensitivity analysis

  • Digital humanities research in Hadith & jurisprudence

  • Linguistic studies of Classical Arabic morphology

Files

NoorGhateh_v3.0.csv

Files (315.8 kB)

Name Size Download all
md5:03cfe1d66be1e77223a9d4236d804209
104.9 kB Preview Download
md5:16f8f2a5ba245778ef8cc0acd94c95bb
112.7 kB Preview Download
md5:689382f3d4277981a3f2f9b84cf7ad36
98.2 kB Preview Download

Additional details

Related works

Is described by
Dataset: arXiv:2307.09630 (arXiv)

Dates

Created
2020-01-01