Heritage Made Digital Digitised Newspapers (full text in CSV format)

Beelen, Kaspar

doi:10.5281/zenodo.15056046

Published March 20, 2025 | Version v1

Dataset Open

Heritage Made Digital Digitised Newspapers (full text in CSV format)

Beelen, Kaspar¹

1. School of Advanced Study

Introduction

The dataset contains 14 newspaper titles digitised by the Heritage Made Digital project. In total, this Zenodo record comprises 2,901,554 items (articles and other content) published between 1801 and 1871. This figure shows the distribution of the items over time.

Processing

The data was derived from the METS/ALTO XML format. Using the alto2txt tool, the data was converted to text and XML files, one file per item. The alto2txt2csv converted these data to a structured CSV format. Each newspaper title is converted to a CSV file with the filename corresponding to the NLP identifier, as used by the British Library Catalogue. The files contain the following columns:

Description

article_headline (string): the headline or title of the article

item_type (string): content type, e.g. 'ARTICLE' or 'ADVERT'

ocr_quality_mean (float): average word OCR quality

ocr_quality_sd (float): standard deviation of the word OCR quality

word_count (integer): number of words

plain_text_file (string): original location of the plain text file

date (string): date of publication in YYYY-MM-DD format

newspaper_title (string): the title of the newspaper (see lwm-titles.txt for an overview)

location (string): place of publication

source (string): data provider (mostly British Library)

text (string): full text content of the item

year (integer): year of publication

month (integer): month of publication

day (integer): day of publication

NLP (integer): identifier for digitised newspaper as recorded by the British Library Catalogue

issue (integer): issue number of the newspaper

art_num (string): item identifier within the newspaper

Contextualisation

We have written some code that enables users to contextualise the newspaper titles by adding information about political leaning, price and more. These additional reference data are derived from historical press directories. For more information, please read our recent paper on the environmental scan.

Beelen, K., J. Lawrence, K. McDonough, and D. C. S. Wilson. "Whose News? Critical methods for assessing bias in large historical datasets." Computational Humanities Research: 1-21.

Beelen, K., Lawrence, J., Wilson, D.C. and Beavan, D., 2023. Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities, 38(1), pp.1-22.

To add contextual information, please consult this notebook, and reuse the function code blocks shown below. The df variable contains a newspaper title; the df_metadata variable contains all the metadata/reference data derived from the press directories.

def add_context(df, df_metadata):
    df['month'] = df['date'].str[:7]
    df = df.merge(df_metadata, right_on='month',left_on='month', how='left')
    return df

Metadata Description

Title (string): Newspaper title as recorded in the British Library catalogue

index_npd (string): index of newspaper description within the press directory edition

id (string): index of the newspaper with

chain_id (string): identifier linking newspaper titles over time

S-TITLE (string): Newspaper title as recorded in the press directories

D-EST (string): Date of establishment

S-POL (string): political leaning of the newspaper title

S-PRICE (string): price of newspaper title, usually price per issue, but sometimes annually

DISTRICT_PUB (string): district of publication

COUNTY (string): county of publication

E-LOC (string): location where the newspaper title claims to circulate

E-ORG (string): organisation mentioned in the newspaper description

E-PER (string): person mentioned in the newspaper description

Files

hmd-csv.zip

Files (6.0 GB)

Name	Size	Download all
hmd-csv.zip md5:2f30eda67232ea6cf14584d60dec978f	6.0 GB	Preview Download
hmd-metadata.zip md5:9fbabc699f3832de3e92dd44fa04cd2b	15.1 kB	Preview Download
hmd-titles.txt md5:d8573226c4b7e8cb37c09a47a09b6e05	365 Bytes	Preview Download

Additional details

Arts and Humanities Research Council
Living with Machines AH/S01179X/1

	All versions	This version
Views	11	11
Downloads	3	3
Data volume	6.0 GB	6.0 GB

Heritage Made Digital Digitised Newspapers (full text in CSV format)

Creators

Description

Introduction

Processing

Description

Contextualisation

Metadata Description

Files

hmd-csv.zip

Files (6.0 GB)

Additional details

Funding