Ajami Handwritten Text Recognition Dataset
Authors/Creators
Description
Technical Details
This dataset contains images of Fulfulde, Hausa, and Wolof Ajami manuscripts and each page's polygon coordinates for segmentation (region and line), and each line's transcription. There are 48 manuscripts (5 Fulfulde, 24 Hausa, 19 Wolof) which total 713 pages and 10,875 lines. The manuscripts are sourced from Boston University's "The Four Languages" and "African Ajami Library", Arewa House at Ahmadu Bello University, and the Hill Museum and Manuscript Library (HMML). The majority of of these manuscripts are poems. The Fulfulde manuscripts originally come from Guinea (Fuuta Jalon and Conakry) and Mali (Timbuktu), the Hausa manuscripts all come from Nigeria (Zaria and Kano), and the Wolof manuscripts all come from Touba, Senegal..
Data Structure
The dataset is first organized by language, then organized by transcription method. The gold-standard, diplomatic transcription of all manuscripts (as they appear in the original manuscripts) are in the "manual" directory/folder for each manuscript. All other directories/folders for a manuscript are automatic transcription attempts by various Arabic-script OCR/HTR models, which are of much lower quality. To access the ground-truth transcriptions you only need to extract the "manual" folder for each manuscript.