There is a newer version of the record available.

Published December 20, 2024 | Version beta
Other Open

TraPrInq - Transcription model to recognize Portuguese handwritten texts from the 16th to 19th centuries

Authors/Creators

  • 1. Author

Description

Description

This model trained using the TraPrInq Project dataset, which encompasses a comprehensive collection of Portuguese handwritten documents from the 16th to 19th centuries. The dataset, made available through Open Science principles on Zenodo (Portuguese Handwriting 16th-19th c. Dataset), includes over 6,400 transcribed pages from the Portuguese Inquisition records, divided into nine training sets and one final validation set.

The dataset was meticulously transcribed and curated by members of the TraPrInq project, ensuring the highest level of accuracy and alignment with paleographic transcription standards. The model benefits from these high-quality transcriptions and aims to facilitate the recognition of Portuguese historical manuscripts, addressing challenges like degraded texts, varying handwriting styles, and historical abbreviations.

 

Model Details (BETA VERSION)

The model was trained by Weslley Oliveira with Kraken v.5.3.1 using:

Datasets: Training set 146740 lines, validation set 16305 lines,

Normalization: Unicode Normalization Form Decomposed (NFD).

Specification: [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,64 Do0.1,2 Mp2,2 Cr3,9,128 Do0.1,2 Mp2,2 Cr3,9,256 Do0.1,2 S1(1x0)1,3 Lbx400 Do0.1,2 Lbx400 Do.1,2 Lbx400 Do]

Optimization: Ground truth corrections based on manual transcription of historical texts.

The resulting model is designed to recognize Portuguese handwritten documents from a variety of historical contexts, with potential applications in digitization projects, archival research, and cultural heritage preservation.

 

Results (BETA VERSION)

Report generated using random 4% of the dataset:

  • 93.16%  Character Accuracy
  • 95.59%  Character Accuracy (Case-insensitive)
  • 95.50%  Latin/Common Character

Report 

258977  Characters
17726    Errors
93.16%  Character Accuracy
93.59%  Character Accuracy (Case-insensitive)
74.59%  Word Accuracy

7434    Insertions
2089    Deletions
8203    Substitutions

Count   Missed  %Right
204481 11241   94.50%  Latin
48830   2711     94.45%  Common
5666    1685      70.26%  Inherited

Reference

Portuguese Handwriting 16th-19th c. - TraPrInq - https://zenodo.org/records/13986218

e-Inquisition - Transcribing the court records of the Portuguese Inquisition (1536-1821) - https://traprinq.hypotheses.org/

kraken - Kraken is a turn-key OCR system optimized for historical and non-Latin script material - https://github.com/mittagessen/kraken

Files

metadata.json

Files (90.3 MB)

Name Size Download all
md5:dd6749b2caed732a76e7fdb90d2cfeef
2.5 kB Preview Download
md5:9e85b400cac29a66644ae50cc0a60e5b
90.3 MB Download