Published June 30, 2025
| Version 1.0
Dataset
Open
The Revolutionary City Corpus (1758-1805): Ground Truth for Handwritten Text Recognition (HTR) for 18th Century Documents in English
Authors/Creators
Description
Project Overview
The Revolutionary City is a partnership between the American Philosophical Society, the Historical Society of Pennsylvania, and the Library Company of Philadelphia to digitize all manuscript material related to Philadelphia and the American Revolution (1763-1804). This dataset is a transcribed subset of the larger digitized corpus. The material is overwhelmingly in English, though a few letters in French have been included. The material contains a mixture of correspondence and journals and encompasses a wide variety of social classes. A significant subset of the documents comprise writings by women. The correspondence has been annotated to distinguish between the different parts of a letter (Salutation, Date and Address, Addressee, Address, Closing, Postscript). The transcriptions were produced by staff and interns at the American Philosophical Society. Each document was reviewed at least once by another transcriber. The corpus exhibits a wide variety of variation in hands, handwriting styles, paper quality and levels of damage. The corpus encompasses material from 1758 to 1805, but the majority of the documents fall between the years 1774 to 1783.
Ground Truth Features
Language: The vast majority of the documents are in English. Fewer than ten pages are fully in French, though French and Latin text appears occasionally in the corpus.
| Pages | 3316 |
| Lines | 95990 |
| Text Regions | 9439 |
| Characters | 2,997,053 |
Transcription Conventions
Transcribers followed diplomatic transcription conventions. All abbreviations were transcribed as written. Superscript text was transcribed with a carat (
^). Linebreaks were normalized to use a plain dash (-) regardless of the character used in the original document. The early modern glyph for "per" (⅌) was transcribed as per. Other infrequently occurring glyphs were transcribed as written. Transcribers were advised to produce their best guesses for illegible text, but text that was illegible due to damage or other insurmountable problems was transcribed as [___]. Square brackets were used to indicate crossed-out text.Dataset Contents
The dataset contains 3316 images in either jpg or png format and 3316 corresponding XML files in ALTO format. Transcriptions were generated and manually corrected using eScriptorium.
Files
revcity-htr.zip
Files
(10.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:654f05888fdfb5365920b141bc0902be
|
2.6 kB | Download |
|
md5:c291f2ebda6e5698f788a06b9520c709
|
10.0 GB | Preview Download |