Printed Urdu Base Model Trained on the OpenITI Corpus

Benjamin Kiessling

doi:10.5281/zenodo.7051646

There is a newer version of the record available.

Published September 5, 2022 | Version v1

Other Open

Printed Urdu Base Model Trained on the OpenITI Corpus

Benjamin Kiessling¹

1. École Pratique des Hautes Études, Aoroc - CNRS PSL

This is a text recognition model trained on the OpenITI dataset of printed Arabic-script text available at [0] in its state of 2022-09-03. It encompasses Urdu (~11k lines) material in a variety of typefaces. The model has been obtained by fine-tuning the Arabic-script base model [1] on the purely Urdu subset of the corpus. As the model is trained on a variety of highly diverse typefaces it is mostly intended as a base model for fine-tuning more specific models from it. In line with this it has not been extensively verified or optimized. The ground truth was lightly normalized to NFD but is otherwise untouched. [0]: https://github.com/OpenITI/arabic_print_data.git [1]: 10.5281/zenodo.7050270

Files

metadata.json

Files (16.3 MB)

Name	Size	Download all
metadata.json md5:ea2cf4baf331624b961227e60ec29023	2.6 kB	Preview Download
urdu_best.mlmodel md5:6b9a7f3f8fc2ae68019b8dd457b0b1f3	16.3 MB	Download

875

Views

Downloads

Show more details

	All versions	This version
Views	875	368
Downloads	6,398	4,693
Data volume	10.7 GB	3.1 GB

More info on how stats are collected....

DOI

Resource type

Other

Publisher

Zenodo

License: Creative Commons Zero v1.0 Universal

CC0 waives copyright interest in a work you've created and dedicates it to the world-wide public domain. Use CC0 to opt out of copyright entirely and ensure your work has the widest reach. Read more

Technical metadata

Created: September 5, 2022
Modified: September 6, 2022

Printed Urdu Base Model Trained on the OpenITI Corpus

Authors/Creators

Description

Files

metadata.json

Files (16.3 MB)