llustrated London News Illustration Dataset (1842-1890)
Description
Description
This dataset contains comprehensive metadata for 72,081 illustrations extracted from the Illustrated London News (ILN) between 1842-1890. The ILN was the first and most influential illustrated newspaper of the Victorian era, making this dataset a valuable resource for researchers in digital humanities, media history, and visual culture studies. The dataset provides detailed information about each illustration, enabling large-scale analysis of Victorian visual culture and the evolution of newspaper illustration practices.
Content
The dataset consists of a CSV file containing the following information for each illustration:
- Publication date (YYYY-MM-DD format)
- Volume and issue number
- Page number within issue
- Bounding box coordinates (in YOLO format)
- Model confidence score from the detection model
- llustration sequence number on page (indicating reading order)
- OCR-extracted caption text
- Original Internet Archive item identifier
- Page URL for accessing the original scan
This dataset also contains a pt file with multimodal embeddings (Open-CLIP) for all the illustrations
Methods
The illustrations were systematically extracted using several computational steps:
- Collection of 56,699 digitized pages from the Internet Archive's Serials in Microfilm Collection
- Fine-tuning of YOLOv8 object detection model on 908 manually annotated pages (mAP50: 0.964, mAP95: 0.92)
- Automated extraction of illustrations using the fine-tuned model
- Caption text extraction using Tesseract OCR
- Generation of multimodal embeddings using LAION OpenCLIP model (ViT-L-14-DataComp.XL-s13B-b90K
Code Availability
All code used to create this dataset is available in two GitHub repositories:
Repository: https://github.com/tpsmi/multimodaliln
- Jupyter notebooks for downloading ILN pages
- YOLOv8 fine-tuning code
- Illustration extraction pipeline
- OCR processing scripts
- Embedding generation code
Repository: https://github.com/tpsmi/ilnmultimodalsearch
- Multimodal search implementation
- Text-to-image and image-to-image retrieval
- User interface code
- API endpoints for search functionality
Original Data Source
The original page scans are freely available through the Internet Archive's Serials in Microfilm Collection. This dataset builds upon these public domain materials by providing structured metadata and computational annotations.
Citation
Please cite our dataset paper (will be added) or this dataset (Zenodo DOI)
Related Publications
[Publication details when available]
Files
iln_text_date_volume_issue_page.csv
Files
(242.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:134246507b967cb6a0d0a2ae30121aa4
|
21.3 MB | Preview Download |
|
md5:dd011727d1f4feaad89b81f81cdc86ac
|
221.4 MB | Download |
Additional details
Dates
- Collected
-
2024-06-01
- Available
-
2024-11-18
Software
- Repository URL
- https://github.com/tpsmi/multimodaliln
- Programming language
- Python