Published October 16, 2023
| Version 1.0
Dataset
Open
Greetings From! Historical Postcards Address Transcription Dataset
Description
This dataset provides both Ground Truth (GT) and Handwritten Text Recognition (HTR) transcriptions of historical postcard addresses, stemming from a project to extract address information from historical picture postcards from Belgium, France, Germany, Luxembourg, the Netherlands, and the UK. The dataset encapsulates the back of 500 historically significant postcards.
The research associated with this dataset will be presented at Computational Humanities Research Conference, December 6--8, 2023, Paris, France.
Scope and Content:
- HTR Material: Handwritten Text Recognition outputs for 500 postcards.
- GT Material: Ground Truth transcriptions created by human transcribers for the same set of 500 postcards.
File Structure and Formats:
For both HTR and GT Material, the following files are provided:
- JPEG Images: Scanned or digitized images of the postcards.
- .txt: Plain text transcriptions of the postcards.
- _tei.xml: Transcriptions rendered in the TEI XML format.
- .pdf: PDF presentation of the postcards along with their transcriptions.
- mets.xml: METS (Metadata Encoding and Transmission Standard) schema for the data.
- page folder: XML files for individual images, offering metadata and structural information.
- metadata.xml: metadata concerning the dataset.
- GT_addresses_GPT4.json & HTR_addresses_GPT4.json: JSON files detailing individual address data for each postcard in structured format.
Annotation and Transcription:
- GT: Ground Truth data was annotated by human transcribers who examined both the images of the postcards and the outputs of the HTR system. Transcribers made corrections according to predefined conventions: using # for illegible characters, * at the start of lines without address information (e.g., person's name), and starting a line with @ for irrelevant lines.
- HTR: The HTR versions emerged from state-of-the-art HTR systems (Transkribus Text Titan I). The .json files hold precise address details derived from the main data, which were processed using OpenAI's GPT-4 Large Language Model.