Printed Urdu Base Model Trained on the OpenITI Corpus
Description
Printed Urdu Base Model Trained on the OpenITI Corpus
This is a text recognition model trained on the OpenITI dataset of printed Arabic-script text available here in its state of 2022-09-03. It encompasses Urdu (~11k lines) material in a variety of typefaces. The model has been obtained by fine-tuning the Arabic-script base model on the purely Urdu subset of the corpus.
The ground truth was lightly normalized to NFD but is otherwise untouched.
Architecture
The default model architecture and hyperparameters of kraken 4.x where used.
Uses
The model is trained on a variety of highly diverse typefaces it is mostly intended as a base model for fine-tuning more specific models from it. In line with this it has not been extensively verified or optimized.
How to Get Started with the Model
Follow the instructions on installing and using kraken from the website.
Metrics
CER: 4.13%
Files
metadata.json
Additional details
Related works
- Is derived from
- Dataset: https://github.com/OpenITI/arabic_print_data.git (URL)
- Other: 10.5281/zenodo.7050296 (DOI)