Published November 15, 2024 | Version 1.0.0
Dataset Open

llustrated London News Illustration Dataset (1842-1890)

  • 1. ROR icon University of Amsterdam

Description

Description

This dataset contains comprehensive metadata for 72,081 illustrations extracted from the Illustrated London News (ILN) between 1842-1890. The ILN was the first and most influential illustrated newspaper of the Victorian era, making this dataset a valuable resource for researchers in digital humanities, media history, and visual culture studies. The dataset provides detailed information about each illustration, enabling large-scale analysis of Victorian visual culture and the evolution of newspaper illustration practices. 

Content

The dataset consists of a CSV file containing the following information for each illustration:

  • Publication date (YYYY-MM-DD format)
  • Volume and issue number
  • Page number within issue
  • Bounding box coordinates (in YOLO format)
  • Model confidence score from the detection model
  • llustration sequence number on page (indicating reading order)
  • OCR-extracted caption text
  • Original Internet Archive item identifier
  • Page URL for accessing the original scan

This dataset also contains a pt file with multimodal embeddings (Open-CLIP) for all the illustrations

Methods

The illustrations were systematically extracted using several computational steps:

  1. Collection of 56,699 digitized pages from the Internet Archive's Serials in Microfilm Collection
  2. Fine-tuning of YOLOv8 object detection model on 908 manually annotated pages (mAP50: 0.964, mAP95: 0.92)
  3. Automated extraction of illustrations using the fine-tuned model
  4. Caption text extraction using Tesseract OCR
  5. Generation of multimodal embeddings using LAION OpenCLIP model (ViT-L-14-DataComp.XL-s13B-b90K

Code Availability

All code used to create this dataset is available in two GitHub repositories:

Repository: https://github.com/tpsmi/multimodaliln 

  • Jupyter notebooks for downloading ILN pages
  • YOLOv8 fine-tuning code
  • Illustration extraction pipeline
  • OCR processing scripts
  • Embedding generation code

Repository: https://github.com/tpsmi/ilnmultimodalsearch 

  • Multimodal search implementation
  • Text-to-image and image-to-image retrieval
  • User interface code
  • API endpoints for search functionality

Original Data Source

The original page scans are freely available through the Internet Archive's Serials in Microfilm Collection. This dataset builds upon these public domain materials by providing structured metadata and computational annotations.

Citation

Please cite our dataset paper (will be added) or this dataset (Zenodo DOI)

Related Publications

[Publication details when available]

Files

iln_text_date_volume_issue_page.csv

Files (242.7 MB)

Name Size Download all
md5:134246507b967cb6a0d0a2ae30121aa4
21.3 MB Preview Download
md5:dd011727d1f4feaad89b81f81cdc86ac
221.4 MB Download

Additional details

Dates

Collected
2024-06-01
Available
2024-11-18

Software

Repository URL
https://github.com/tpsmi/multimodaliln
Programming language
Python