Kiswahili-Tz-Hansard

Brian Muhia

doi:10.5281/zenodo.6643278

Published June 13, 2022 | Version 0.4

Dataset Open

Kiswahili-Tz-Hansard

Brian Muhia

This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.

Files

Swahili-Tz-Hansard Datasheet-v0.4.pdf

Files (372.0 MB)

Name	Size	Download all
extracted-tz-parliament.tar.gz md5:da26941a3d3a4d623739ceefd0f19c22	1.2 MB	Download
source-png.tar.gz md5:7e819a4ee597fb0565ac5666706a001b	370.7 MB	Download
Swahili-Tz-Hansard Datasheet-v0.4.pdf md5:79de112e612901716dd2684667adf979	75.1 kB	Preview Download

Citations

Oops! Something went wrong while fetching results.

351

Views

Downloads

Show more details

	All versions	This version
Views	351	123
Downloads	67	45
Data volume	4.8 GB	1.9 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Swahili (macrolanguage)

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: June 14, 2022
Modified: July 16, 2024

Kiswahili-Tz-Hansard

Creators

Description

Files

Swahili-Tz-Hansard Datasheet-v0.4.pdf

Files (372.0 MB)