NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems
Description
Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmentation Tools in the Hadith Domain
📘 Overview
Noor-Ghateh is a manually annotated Classical Arabic morphological dataset derived from the jurisprudential text Sharayeʿ al-Islam.
The dataset provides fine-grained clitic segmentation, 15 morphological attributes, and gold-standard human annotation, making it a valuable benchmark for:
-
Morphological analyzers
-
Segmentation systems
-
Lemmatizers & root extractors
-
Classical Arabic NLP research
-
Benchmarking domain sensitivity across analyzers
The dataset includes 223,690 tokens, with a publicly available 313-token sample released in XML, JSON, and CSV-embedded-XML formats.
🧱 Data Format
1. XML Format (Primary)
The XML structure uses <Base> → <Root> → <word> hierarchy.
Each <word> element includes 14 morphological attributes such as:
-
Seq — morpheme order
-
Slice — surface form
-
Affix — prefix/suffix/stem
-
Pos, Lemma, Case, Categ, DervT, Num, Root
-
TOV, Time, Voic, Kol, Lang
2. JSON Format
Direct JSON mapping of the XML hierarchy for machine learning pipelines.
3. CSV-embedded XML
Each row contains:
Surface form — Segmented form — XML annotation block
🎯 Intended Use Cases
-
Training and evaluating morphological segmentation systems
-
Testing classical Arabic analyzers (Farasa, CAMeL Tools, ALP, MADAMIRA)
-
Building lemmatizers and root extractors
-
Domain-sensitivity analysis
-
Digital humanities research in Hadith & jurisprudence
-
Linguistic studies of Classical Arabic morphology
Files
NoorGhateh_v3.0.csv
Additional details
Related works
- Is described by
- Dataset: arXiv:2307.09630 (arXiv)
Dates
- Created
-
2020-01-01