Mkulima: A Domain-Specific Swahili Corpus for Agricultural NLP and Information Retrieval
Authors/Creators
Description
Mkulima (Swahili for "farmer") is the first large-scale domain-specific Swahili corpus for agricultural natural language processing (NLP) and information retrieval. The corpus comprises 4,021 documents totalling 35.2 million characters (5.4 million words) drawn from five distinct source types spanning over three decades (1995–2026): government reports, FAO publications, agricultural extension materials, news articles, and agricultural blogs.
The corpus covers 25 agricultural domains including staple crops (mahindi, mpunga, muhogo), cash crops (kahawa, korosho, pamba), livestock (mifugo, kuku, ufugaji), fisheries (samaki, uvuvi), and agricultural inputs (mbolea, mbegu, umwagiliaji). The vocabulary contains 137,980 unique word types with a Type-Token Ratio (TTR) of 0.0272.
This release includes:
- mkulima_all.jsonl: Full corpus (4,021 documents)
- mkulima_offline.jsonl: Offline documents only (653 documents, PDFs/DOCX)
- mkulima_online.jsonl: Web-scraped documents only (3,368 documents)
- queries.jsonl: 50 Swahili agricultural retrieval queries
- qrels.tsv: Binary relevance judgments (0/1) for the retrieval benchmark
- README.md: Full documentation, schema description, and usage instructions
Each document follows a standardized JSONL schema with fields: doc_id, title, source, source_type, date, language, file_type, text, url, original_filename, word_count.
Two benchmark evaluations accompany the corpus:
(1) Domain language modeling: fine-tuning AfroXLMR-base on Mkulima reduces perplexity by 8.8% (3.44 → 3.13) on agricultural Swahili text.
(2) Agricultural document retrieval: BM25 over 50 Swahili queries achieves nDCG@10 = 0.6041, MAP = 0.5870, MRR = 0.7463.
This corpus is introduced in the paper "Mkulima: A Domain-Specific Swahili Corpus for Agricultural NLP and Information Retrieval" submitted to CIKM 2026.
Files
README.md
Files
(74.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2f61953ff4784e7d5b29c6ebc3d47e4e
|
36.9 MB | Download |
|
md5:5603a98c3a9b31e846bfb088a9a4e41c
|
25.9 MB | Download |
|
md5:c5ad662ca43b844fc55b62c1301398c1
|
11.0 MB | Download |
|
md5:a25610cd00a018838ddb3c918151f003
|
216.0 kB | Download |
|
md5:fb33f8d200170db9c9f8600729222d86
|
3.3 kB | Download |
|
md5:3040a2ba898e1ee7730bbc54eb01312b
|
1.0 kB | Preview Download |