Zenodo.org will be unavailable for 2 hours on September 29th from 06:00-08:00 UTC. See announcement.

Dataset Open Access

19th Century United States Newspaper images predicted as Photographs with labels for "human", "animal", "human-structure" and "landscape"

van Strien, Daniel

Data collector(s)
Bond-Harris, Catherine

The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). 

[The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.

source: https://news-navigator.labs.loc.gov/

One of these categories is 'photographs'. This dataset contains a sample of these images with additional labels indicating if the photograph has one or more of the following labels: "human", "animal", "human-structure" and "landscape"

The data is organised as follows:

  • The images themselves can be found in `images.zip`
  • `newspaper-navigator-sample-metadata.csv` contains metadata about each image drawn from the Newspaper Navigator Dataset.
  • `multi_label.csv` contains the labels for the images as a CSV file
  • `annotations.csv` conains the labels for the images with additional metadata

This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt2) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly.

The metadata CSV file contains the following columns:

- filepath
- pub_date
- page_seq_num
- edition_seq_num
- batch
- lccn
- box
- score
- ocr
- place_of_publication
- geographic_coverage
- name
- publisher
- url
- page_url
- month
- year
- iiif_url

Files (887.4 MB)
Name Size
2.3 MB Download
880.1 MB Download
168.5 kB Download
2.6 MB Download
2.2 MB Download
All versions This version
Views 907907
Downloads 7575
Data volume 17.7 GB17.7 GB
Unique views 861861
Unique downloads 4848


Cite as