Published June 13, 2022 | Version 0.4
Dataset Open

Kiswahili-Tz-Hansard

Creators

Description

This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.

Files

Swahili-Tz-Hansard Datasheet-v0.4.pdf

Files (372.0 MB)

Name Size Download all
md5:da26941a3d3a4d623739ceefd0f19c22
1.2 MB Download
md5:7e819a4ee597fb0565ac5666706a001b
370.7 MB Download
md5:79de112e612901716dd2684667adf979
75.1 kB Preview Download