Published June 30, 2025 | Version 1.0
Dataset Open

The Revolutionary City Corpus (1758-1805): Ground Truth for Handwritten Text Recognition (HTR) for 18th Century Documents in English

  • 1. ROR icon American Philosophical Society
  • 2. ROR icon University of Pennsylvania
  • 3. EDMO icon University of Oklahoma
  • 4. ROR icon Millersville University
  • 5. ROR icon Washington College

Description

Project Overview

The Revolutionary City is a partnership between the American Philosophical Society, the Historical Society of Pennsylvania, and the Library Company of Philadelphia to digitize all manuscript material related to Philadelphia and the American Revolution (1763-1804). This dataset is a transcribed subset of the larger digitized corpus. The material is overwhelmingly in English, though a few letters in French have been included. The material contains a mixture of correspondence and journals and encompasses a wide variety of social classes. A significant subset of the documents comprise writings by women. The correspondence has been annotated to distinguish between the different parts of a letter (Salutation, Date and Address, Addressee, Address, Closing, Postscript). The transcriptions were produced by staff and interns at the American Philosophical Society. Each document was reviewed at least once by another transcriber. The corpus exhibits a wide variety of variation in hands, handwriting styles, paper quality and levels of damage. The corpus encompasses material from 1758 to 1805, but the majority of the documents fall between the years 1774 to 1783.

Ground Truth Features

Language: The vast majority of the documents are in English. Fewer than ten pages are fully in French, though French and Latin text appears occasionally in the corpus.

Pages 3316
Lines 95990
Text Regions 9439
Characters 2,997,053

 Transcription Conventions

Transcribers followed diplomatic transcription conventions. All abbreviations were transcribed as written. Superscript text was transcribed with a carat (^). Linebreaks were normalized to use a plain dash (-) regardless of the character used in the original document. The early modern glyph for "per" () was transcribed as per. Other infrequently occurring glyphs were transcribed as written. Transcribers were advised to produce their best guesses for illegible text, but text that was illegible due to damage or other insurmountable problems was transcribed as [___]. Square brackets were used to indicate crossed-out text.

Dataset Contents

The dataset contains 3316 images in either jpg or png format and 3316 corresponding XML files in ALTO format. Transcriptions were generated and manually corrected using eScriptorium.

Files

revcity-htr.zip

Files (10.0 GB)

Name Size Download all
md5:654f05888fdfb5365920b141bc0902be
2.6 kB Download
md5:c291f2ebda6e5698f788a06b9520c709
10.0 GB Preview Download