Printed Urdu Base Model Trained on the OpenITI Corpus

Benjamin Kiessling

doi:10.5281/zenodo.14585602

Published January 1, 2025 | Version v3

Other Open

Printed Urdu Base Model Trained on the OpenITI Corpus

Benjamin Kiessling¹

1. École Pratique des Hautes Études, PSL University

Printed Urdu Base Model Trained on the OpenITI Corpus

This is a text recognition model trained on the OpenITI dataset of printed Arabic-script text available here in its state of 2022-09-03. It encompasses Urdu (~11k lines) material in a variety of typefaces. The model has been obtained by fine-tuning the Arabic-script base model on the purely Urdu subset of the corpus.

The ground truth was lightly normalized to NFD but is otherwise untouched.

Architecture

The default model architecture and hyperparameters of kraken 4.x where used.

Uses

The model is trained on a variety of highly diverse typefaces it is mostly intended as a base model for fine-tuning more specific models from it. In line with this it has not been extensively verified or optimized.

How to Get Started with the Model

Follow the instructions on installing and using kraken from the website.

Metrics

CER: 4.13%

Files

metadata.json

Files (16.3 MB)

Name	Size	Download all
.README.md.swp md5:165b2a0e69b4a079c45a92939d79616e	12.3 kB	Download
metadata.json md5:ea2cf4baf331624b961227e60ec29023	2.6 kB	Preview Download
README.md md5:0731de4a7e7329f115263364c6ef1e5e	1.6 kB	Preview Download
urdu_best.mlmodel md5:6b9a7f3f8fc2ae68019b8dd457b0b1f3	16.3 MB	Download

Additional details

Is derived from: Dataset: https://github.com/OpenITI/arabic_print_data.git (URL); Other: 10.5281/zenodo.7050296 (DOI)

	All versions	This version
Views	735	319
Downloads	6,182	1,054
Data volume	9.4 GB	5.9 GB

Printed Urdu Base Model Trained on the OpenITI Corpus

Authors/Creators

Description

Printed Urdu Base Model Trained on the OpenITI Corpus

Architecture

Uses

How to Get Started with the Model

Metrics

Files

metadata.json

Files (16.3 MB)

Additional details

Related works