Published May 30, 2026 | Version v2
Dataset Open

Ajami Handwritten Text Recognition Dataset

  • 1. ROR icon Uppsala University
  • 2. Murid Islamic Community in America, Inc. (MICA, Inc.)
  • 3. ROR icon Ahmadu Bello University
  • 4. ROR icon University of Maiduguri
  • 5. ROR icon Bayero University Kano
  • 6. ROR icon Stockholm University
  • 7. ROR icon University of Southern Denmark
  • 8. ROR icon Lund University

Description

Technical Details

This dataset contains images of Fulfulde, Hausa, and Wolof Ajami manuscripts and each page's polygon coordinates for segmentation (region and line), and each line's transcription. There are 48 manuscripts (5 Fulfulde, 24 Hausa, 19 Wolof) which total 713 pages and 10,875 lines. The manuscripts are sourced from Boston University's "The Four Languages" and "African Ajami Library", Arewa House at Ahmadu Bello University, and the Hill Museum and Manuscript Library (HMML). The majority of of these manuscripts are poems. The Fulfulde manuscripts originally come from Guinea (Fuuta Jalon and Conakry) and Mali (Timbuktu), the Hausa manuscripts all come from Nigeria (Zaria and Kano), and the Wolof manuscripts all come from Touba, Senegal..

 

Data Structure

The dataset is first organized by language, then organized by transcription method. The gold-standard, diplomatic transcription of all manuscripts (as they appear in the original manuscripts) are in the "manual" directory/folder for each manuscript. All other directories/folders for a manuscript are automatic transcription attempts by various Arabic-script OCR/HTR models, which are of much lower quality. To access the ground-truth transcriptions you only need to extract the "manual" folder for each manuscript.

Files

Fulfulde.zip

Files (40.2 GB)

Name Size Download all
md5:4621cd99a1edbb6cab22dd024bb4dd27
2.6 GB Preview Download
md5:16a4453e6a8192c88b6a071bcb79e2d9
14.2 GB Preview Download
md5:94bead235f1ebf305699e89e7c7b1697
23.4 GB Preview Download

Additional details