llustrated London News Illustration Dataset (1842-1890)

Smits, Thomas

doi:10.5281/zenodo.14169699

Published November 15, 2024 | Version 1.0.0

Dataset Open

llustrated London News Illustration Dataset (1842-1890)

Smits, Thomas (Project leader)¹

1. University of Amsterdam

Description

This dataset contains comprehensive metadata for 72,081 illustrations extracted from the Illustrated London News (ILN) between 1842-1890. The ILN was the first and most influential illustrated newspaper of the Victorian era, making this dataset a valuable resource for researchers in digital humanities, media history, and visual culture studies. The dataset provides detailed information about each illustration, enabling large-scale analysis of Victorian visual culture and the evolution of newspaper illustration practices.

Content

The dataset consists of a CSV file containing the following information for each illustration:

Publication date (YYYY-MM-DD format)
Volume and issue number
Page number within issue
Bounding box coordinates (in YOLO format)
Model confidence score from the detection model
llustration sequence number on page (indicating reading order)
OCR-extracted caption text
Original Internet Archive item identifier
Page URL for accessing the original scan

This dataset also contains a pt file with multimodal embeddings (Open-CLIP) for all the illustrations

Methods

The illustrations were systematically extracted using several computational steps:

Collection of 56,699 digitized pages from the Internet Archive's Serials in Microfilm Collection
Fine-tuning of YOLOv8 object detection model on 908 manually annotated pages (mAP50: 0.964, mAP95: 0.92)
Automated extraction of illustrations using the fine-tuned model
Caption text extraction using Tesseract OCR
Generation of multimodal embeddings using LAION OpenCLIP model (ViT-L-14-DataComp.XL-s13B-b90K

Code Availability

All code used to create this dataset is available in two GitHub repositories:

Repository: https://github.com/tpsmi/multimodaliln

Jupyter notebooks for downloading ILN pages
YOLOv8 fine-tuning code
Illustration extraction pipeline
OCR processing scripts
Embedding generation code

Repository: https://github.com/tpsmi/ilnmultimodalsearch

Multimodal search implementation
Text-to-image and image-to-image retrieval
User interface code
API endpoints for search functionality

Original Data Source

The original page scans are freely available through the Internet Archive's Serials in Microfilm Collection. This dataset builds upon these public domain materials by providing structured metadata and computational annotations.

Citation

Please cite our dataset paper (will be added) or this dataset (Zenodo DOI)

Related Publications

[Publication details when available]

Files

iln_text_date_volume_issue_page.csv

Files (242.7 MB)

Name	Size	Download all
iln_text_date_volume_issue_page.csv md5:134246507b967cb6a0d0a2ae30121aa4	21.3 MB	Preview Download
OpenClipILNfull.pt md5:dd011727d1f4feaad89b81f81cdc86ac	221.4 MB	Download

Additional details

Collected: 2024-06-01
Available: 2024-11-18

Repository URL: https://github.com/tpsmi/multimodaliln
Programming language: Python

	All versions	This version
Views	302	302
Downloads	317	317
Data volume	24.3 GB	24.3 GB

iln_text_date_volume_issue_page.csv

Files (242.7 MB)

Dates

Software

llustrated London News Illustration Dataset (1842-1890)

Authors/Creators

Description

Files

iln_text_date_volume_issue_page.csv

Files (242.7 MB)

Additional details

Dates

Software