Published September 18, 2024 | Version v1
Dataset Open

Manually validated PageXML files for images in monography "Mémoire sur St Domingue par H ? M. Michel"

Description

Transcription of monography "Mémoire sur St Domingue par H ? M. Michel", dating from 1797 and dealing on slavery in Haiti (103 pages in total). Transcription contains 61 pages in PageXML format, useful for training a handwritten text recognition model. The PageXML files were created by applying a Transkribus model (French Model 1, see https://readcoop.eu/model/french-general-model/, or the non-public The Text Titan I) on the images at https://europeana.transcribathon.eu/documents/story/?story=12733. The PageXML output was automatically corrected using the flat-text manual transcription available with these images, and the resulting PageXML files were manually validated. The software for automatically correcting OCR output using flat-text manual transcriptions (and hence adding a link between image and text not present in the flat-text files) has been developed in the AI4Culture project (https://pro.europeana.eu/project/ai4culture-an-ai-platform-for-the-cultural-heritage-data-space). Note: transcriptions for pages 21, 22, 34 and 58 are not present yet.

Files

page_xml_corrected.zip

Files (274.7 kB)

Name Size Download all
md5:32cd1c98f3edabeab45924ffddceb4c3
274.7 kB Preview Download