Published March 20, 2025 | Version v1
Dataset Open

Heritage Made Digital Digitised Newspapers (full text in CSV format)

  • 1. ROR icon School of Advanced Study

Description

Introduction

The dataset contains 14 newspaper titles digitised by the Heritage Made Digital project. In total, this Zenodo record comprises 2,901,554 items (articles and other content) published between 1801 and 1871. This figure shows the distribution of the items over time. 

Processing

The data was derived from the METS/ALTO XML format. Using the alto2txt tool, the data was converted to text and XML files, one file per item. The alto2txt2csv converted these data to a structured CSV format. Each newspaper title is converted to a CSV file with the filename corresponding to the NLP identifier, as used by the British Library Catalogue. The files contain the following columns:

Description

article_headline (string): the headline or title of the article

item_type (string): content type, e.g. 'ARTICLE' or 'ADVERT'

ocr_quality_mean (float): average word OCR quality 

ocr_quality_sd (float): standard deviation of the word OCR quality

word_count (integer): number of words

plain_text_file (string): original location of the plain text file

date (string): date of publication in YYYY-MM-DD format

newspaper_title (string): the title of the newspaper (see lwm-titles.txt for an overview)

location (string): place of publication

source (string): data provider (mostly British Library)

text (string): full text content of the item 

year (integer): year of publication

month (integer): month of publication

day (integer): day of publication

NLP (integer): identifier for digitised newspaper as recorded by the British Library Catalogue

issue (integer): issue number of the newspaper 

art_num (string): item identifier within the newspaper

 

Contextualisation

We have written some code that enables users to contextualise the newspaper titles by adding information about political leaning, price and more. These additional reference data are derived from historical press directories. For more information, please read our recent paper on the environmental scan.

Beelen, K., J. Lawrence, K. McDonough, and D. C. S. Wilson. "Whose News? Critical methods for assessing bias in large historical datasets." Computational Humanities Research: 1-21.

Beelen, K., Lawrence, J., Wilson, D.C. and Beavan, D., 2023. Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities38(1), pp.1-22.

 

To add contextual information, please consult this notebook, and reuse the function code blocks shown below. The df variable contains a newspaper title; the df_metadata variable contains all the metadata/reference data derived from the press directories. 

def add_context(df, df_metadata):
    df['month'] = df['date'].str[:7]
    df = df.merge(df_metadata, right_on='month',left_on='month', how='left')
    return df

Metadata Description

Title (string): Newspaper title as recorded in the British Library catalogue
 
index_npd (string): index of newspaper description within the press directory edition
 
id (string): index of the newspaper with
 
chain_id (string): identifier linking newspaper titles over time
 
S-TITLE (string): Newspaper title as recorded in the press directories
 
D-EST (string): Date of establishment
 
S-POL (string): political leaning of the newspaper title
 
S-PRICE (string): price of newspaper title, usually price per issue, but sometimes annually
 
DISTRICT_PUB (string): district of publication
 
COUNTY (string): county of publication
 
E-LOC (string): location where the newspaper title claims to circulate
 
E-ORG (string): organisation mentioned in the newspaper description
 
E-PER (string): person mentioned in the newspaper description

 

Files

hmd-csv.zip

Files (6.0 GB)

Name Size Download all
md5:2f30eda67232ea6cf14584d60dec978f
6.0 GB Preview Download
md5:9fbabc699f3832de3e92dd44fa04cd2b
15.1 kB Preview Download
md5:d8573226c4b7e8cb37c09a47a09b6e05
365 Bytes Preview Download

Additional details

Funding

Arts and Humanities Research Council
Living with Machines AH/S01179X/1