Manually validated PageXML files for images in "Dagboek Ernest Clarysse"
Creators
Description
Transcribed diary of Belgian citizen Ernest Clarysse in World War I (41 pages in total), in PageXML format (pages 3 to 36 were transcribed). These files are useful for training a handwritten text recognition model. The PageXML files were created by applying Transkribus' The Dutchess I (https://readcoop.eu/model/the-dutchess-i/) on the images at https://europeana.transcribathon.eu/documents/story/?story=138148., automatically correcting the output using the flat-text manual transcription available with these images, and manually validating the resulting PageXML files. The software for automatically correcting OCR output using flat-text manual transcriptions (and hence adding a link between image and text not present in the flat-text files) has been developed in the AI4Culture project.
Files
pagexml.zip
Files
(3.4 MB)
Name | Size | Download all |
---|---|---|
md5:d2d3d401a3628aa3a073f24eee45b137
|
3.4 MB | Preview Download |