TraPrInq - Transcription model to recognize Portuguese handwritten texts from the 16th to 19th centuries
Description
Description
This model trained using the TraPrInq Project dataset, which encompasses a comprehensive collection of Portuguese handwritten documents from the 16th to 19th centuries. The dataset, made available through Open Science principles on Zenodo (Portuguese Handwriting 16th-19th c. Dataset), includes over 6,400 transcribed pages from the Portuguese Inquisition records, divided into nine training sets and one final validation set.
The dataset was meticulously transcribed and curated by members of the TraPrInq project, ensuring the highest level of accuracy and alignment with paleographic transcription standards. The model benefits from these high-quality transcriptions and aims to facilitate the recognition of Portuguese historical manuscripts, addressing challenges like degraded texts, varying handwriting styles, and historical abbreviations.
Model Details (BETA VERSION)
The model was trained by Weslley Oliveira with Kraken v.5.3.1 using:
Datasets: Training set 146740 lines, validation set 16305 lines,
Normalization: Unicode Normalization Form Decomposed (NFD).
Specification: [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,64 Do0.1,2 Mp2,2 Cr3,9,128 Do0.1,2 Mp2,2 Cr3,9,256 Do0.1,2 S1(1x0)1,3 Lbx400 Do0.1,2 Lbx400 Do.1,2 Lbx400 Do]
Optimization: Ground truth corrections based on manual transcription of historical texts.
The resulting model is designed to recognize Portuguese handwritten documents from a variety of historical contexts, with potential applications in digitization projects, archival research, and cultural heritage preservation.
Results (BETA VERSION)
Report generated using random 4% of the dataset:
- 93.16% Character Accuracy
- 95.59% Character Accuracy (Case-insensitive)
- 95.50% Latin/Common Character
Report
258977 Characters
17726 Errors
93.16% Character Accuracy
93.59% Character Accuracy (Case-insensitive)
74.59% Word Accuracy7434 Insertions
2089 Deletions
8203 SubstitutionsCount Missed %Right
204481 11241 94.50% Latin
48830 2711 94.45% Common
5666 1685 70.26% Inherited
Reference
Portuguese Handwriting 16th-19th c. - TraPrInq - https://zenodo.org/records/13986218
e-Inquisition - Transcribing the court records of the Portuguese Inquisition (1536-1821) - https://traprinq.hypotheses.org/
kraken - Kraken is a turn-key OCR system optimized for historical and non-Latin script material - https://github.com/mittagessen/kraken
Files
metadata.json
Files
(90.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dd6749b2caed732a76e7fdb90d2cfeef
|
2.5 kB | Preview Download |
|
md5:9e85b400cac29a66644ae50cc0a60e5b
|
90.3 MB | Download |