Heritage Made Digital Digitised Newspapers (full text in CSV format)
Description
Introduction
The dataset contains 14 newspaper titles digitised by the Heritage Made Digital project. In total, this Zenodo record comprises 2,901,554 items (articles and other content) published between 1801 and 1871. This figure shows the distribution of the items over time.
Processing
The data was derived from the METS/ALTO XML format. Using the alto2txt tool, the data was converted to text and XML files, one file per item. The alto2txt2csv converted these data to a structured CSV format. Each newspaper title is converted to a CSV file with the filename corresponding to the NLP identifier, as used by the British Library Catalogue. The files contain the following columns:
Description
article_headline (string): the headline or title of the article
item_type (string): content type, e.g. 'ARTICLE' or 'ADVERT'
ocr_quality_mean (float): average word OCR quality
ocr_quality_sd (float): standard deviation of the word OCR quality
word_count (integer): number of words
plain_text_file (string): original location of the plain text file
date (string): date of publication in YYYY-MM-DD format
newspaper_title (string): the title of the newspaper (see lwm-titles.txt for an overview)
location (string): place of publication
source (string): data provider (mostly British Library)
text (string): full text content of the item
year (integer): year of publication
month (integer): month of publication
day (integer): day of publication
NLP (integer): identifier for digitised newspaper as recorded by the British Library Catalogue
issue (integer): issue number of the newspaper
art_num (string): item identifier within the newspaper
Contextualisation
We have written some code that enables users to contextualise the newspaper titles by adding information about political leaning, price and more. These additional reference data are derived from historical press directories. For more information, please read our recent paper on the environmental scan.
Beelen, K., J. Lawrence, K. McDonough, and D. C. S. Wilson. "Whose News? Critical methods for assessing bias in large historical datasets." Computational Humanities Research: 1-21.
Beelen, K., Lawrence, J., Wilson, D.C. and Beavan, D., 2023. Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities, 38(1), pp.1-22.
To add contextual information, please consult this notebook, and reuse the function code blocks shown below. The df variable contains a newspaper title; the df_metadata variable contains all the metadata/reference data derived from the press directories.
def add_context(df, df_metadata):
df['month'] = df['date'].str[:7]
df = df.merge(df_metadata, right_on='month',left_on='month', how='left')
return df
Metadata Description