Published June 13, 2022
| Version 0.4
Dataset
Open
Kiswahili-Tz-Hansard
Creators
Description
This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.
Files
Swahili-Tz-Hansard Datasheet-v0.4.pdf
Files
(372.0 MB)
Name | Size | Download all |
---|---|---|
md5:da26941a3d3a4d623739ceefd0f19c22
|
1.2 MB | Download |
md5:7e819a4ee597fb0565ac5666706a001b
|
370.7 MB | Download |
md5:79de112e612901716dd2684667adf979
|
75.1 kB | Preview Download |