Published October 16, 2023 | Version 1.0
Dataset Open

Greetings From! Historical Postcards Address Transcription Dataset

  • 1. ROR icon Princeton University
  • 2. ROR icon University of Antwerp

Description

This dataset provides both Ground Truth (GT) and Handwritten Text Recognition (HTR) transcriptions of historical postcard addresses, stemming from a project to extract address information from historical picture postcards from Belgium, France, Germany, Luxembourg, the Netherlands, and the UK. The dataset encapsulates the back of 500 historically significant postcards.

The research associated with this dataset will be presented at Computational Humanities Research Conference, December 6--8, 2023, Paris, France.

Scope and Content:

  • HTR Material: Handwritten Text Recognition outputs for 500 postcards.
  • GT Material: Ground Truth transcriptions created by human transcribers for the same set of 500 postcards.

File Structure and Formats:

For both HTR and GT Material, the following files are provided:

  • JPEG Images: Scanned or digitized images of the postcards.
  • .txt: Plain text transcriptions of the postcards.
  • _tei.xml: Transcriptions rendered in the TEI XML format.
  • .pdf: PDF presentation of the postcards along with their transcriptions.
  • mets.xml: METS (Metadata Encoding and Transmission Standard) schema for the data.
  • page folder: XML files for individual images, offering metadata and structural information.
  • metadata.xml: metadata concerning the dataset.
  • GT_addresses_GPT4.json & HTR_addresses_GPT4.json: JSON files detailing individual address data for each postcard in structured format.

Annotation and Transcription:

  • GT: Ground Truth data was annotated by human transcribers who examined both the images of the postcards and the outputs of the HTR system. Transcribers made corrections according to predefined conventions: using # for illegible characters, * at the start of lines without address information (e.g., person's name), and starting a line with @ for irrelevant lines.
  • HTR: The HTR versions emerged from state-of-the-art HTR systems (Transkribus Text Titan I). The .json files hold precise address details derived from the main data, which were processed using OpenAI's GPT-4 Large Language Model.

Files

GreetingsFrom_GT.zip

Files (103.5 MB)

Name Size Download all
md5:c5199605ed683736cd373bf2720fea5c
53.7 MB Preview Download
md5:8bda4e3702fb5f40097a341bf768f788
49.6 MB Preview Download
md5:a55c326ccbb6646bcfb1d61839340abb
72.9 kB Preview Download
md5:0cc30f7a480f648064e053e34796fa6c
107.3 kB Preview Download