Published June 4, 2026 | Version v1

Mkulima: A Domain-Specific Swahili Corpus for Agricultural NLP and Information Retrieval

  • 1. ROR icon Hanyang University
  • 2. ROR icon Sokoine University of Agriculture

Description

Mkulima (Swahili for "farmer") is the first large-scale domain-specific Swahili corpus for agricultural natural language processing (NLP) and information retrieval. The corpus comprises 4,021 documents totalling 35.2 million characters (5.4 million words) drawn from five distinct source types spanning over three decades (1995–2026): government reports, FAO publications, agricultural extension materials, news articles, and agricultural blogs.

The corpus covers 25 agricultural domains including staple crops (mahindi, mpunga, muhogo), cash crops (kahawa, korosho, pamba), livestock (mifugo, kuku, ufugaji), fisheries (samaki, uvuvi), and agricultural inputs (mbolea, mbegu, umwagiliaji). The vocabulary contains 137,980 unique word types with a Type-Token Ratio (TTR) of 0.0272.

This release includes:
- mkulima_all.jsonl: Full corpus (4,021 documents)
- mkulima_offline.jsonl: Offline documents only (653 documents, PDFs/DOCX)
- mkulima_online.jsonl: Web-scraped documents only (3,368 documents)
- queries.jsonl: 50 Swahili agricultural retrieval queries
- qrels.tsv: Binary relevance judgments (0/1) for the retrieval benchmark
- README.md: Full documentation, schema description, and usage instructions

Each document follows a standardized JSONL schema with fields: doc_id, title, source, source_type, date, language, file_type, text, url, original_filename, word_count.

Two benchmark evaluations accompany the corpus:
(1) Domain language modeling: fine-tuning AfroXLMR-base on Mkulima reduces perplexity by 8.8% (3.44 → 3.13) on agricultural Swahili text.
(2) Agricultural document retrieval: BM25 over 50 Swahili queries achieves nDCG@10 = 0.6041, MAP = 0.5870, MRR = 0.7463.

This corpus is introduced in the paper "Mkulima: A Domain-Specific Swahili Corpus for Agricultural NLP and Information Retrieval" submitted to CIKM 2026.

Files

README.md

Files (74.1 MB)

Name Size Download all
md5:2f61953ff4784e7d5b29c6ebc3d47e4e
36.9 MB Download
md5:5603a98c3a9b31e846bfb088a9a4e41c
25.9 MB Download
md5:c5ad662ca43b844fc55b62c1301398c1
11.0 MB Download
md5:a25610cd00a018838ddb3c918151f003
216.0 kB Download
md5:fb33f8d200170db9c9f8600729222d86
3.3 kB Download
md5:3040a2ba898e1ee7730bbc54eb01312b
1.0 kB Preview Download