--- title: hmd_newspaper_dl keywords: fastai sidebar: home_sidebar summary: "Download Heritage made Digital Newspaper from the BL repository " description: "Download Heritage made Digital Newspaper from the BL repository " nb_path: "00_core.ipynb" ---
{% raw %}
{% endraw %}

The aim of this code is to make it easier to download all of the Heritage Made Digital Newspapers from the British Library's Research Repository.

{% raw %}
{% endraw %}

The Newspapers are currently organised by newspaper title under a collection:

Under each titles you can download a zip file representing a year for that particular newspaper title

If we only want a subset of year or titles we could download these manually but if we're interested in using computational methods it's a bit slow. What we need to do is grab all of the URL's for each title so we can bulk download them all.

{% raw %}
{% endraw %}

This is a smaller helper function that will generate the correct url once we have got an ID for a title.

{% raw %}

get_newspaper_links()

Returns titles from the Newspaper Collection

{% endraw %} {% raw %}
{% endraw %}

This function starts from the Newspaper collection and then uses BeatifulSoup to scrape all of the URLs which link to a newspaper title. We have a hard coded URL here which isn't very good practice but since we're writing this code for a fairly narrow purpose we'll not worry about that here.

If we call this function we get a bunch of links back.

{% raw %}
links = get_newspaper_links()
links
[('The Express',
  'https://bl.iro.bl.uk/concern/datasets/93ec8ab4-3348-409c-bf6d-a9537156f654?locale=en'),
 ('The Press.',
  'https://bl.iro.bl.uk/concern/datasets/2f70fbcd-9530-496a-903f-dfa4e7b20d3b?locale=en'),
 ('The Star',
  'https://bl.iro.bl.uk/concern/datasets/dd9873cf-cba1-4160-b1f9-ccdab8eb6312?locale=en'),
 ('National Register.',
  'https://bl.iro.bl.uk/concern/datasets/f3ecea7f-7efa-4191-94ab-e4523384c182?locale=en'),
 ('The Statesman',
  'https://bl.iro.bl.uk/concern/datasets/551cdd7b-580d-472d-8efb-b7f05cf64a11?locale=en'),
 ('The British Press; or, Morning Literary Advertiser',
  'https://bl.iro.bl.uk/concern/datasets/aef16a3c-53b6-4203-ac08-d102cb54f8fa?locale=en'),
 ('The Sun',
  'https://bl.iro.bl.uk/concern/datasets/b9a877b8-db7a-4e5f-afe6-28dc7d3ec988?locale=en'),
 ('The Liverpool Standard etc',
  'https://bl.iro.bl.uk/concern/datasets/fb5e24e3-0ac9-4180-a1f4-268fc7d019c1?locale=en'),
 ('Colored News',
  'https://bl.iro.bl.uk/concern/datasets/bacd53d6-86b7-4f8a-af31-0a12e8eaf6ee?locale=en'),
 ('The Northern Daily Times etc',
  'https://bl.iro.bl.uk/concern/datasets/5243dccc-3fad-4a9e-a2c1-d07e750c46a6?locale=en')]
{% endraw %}

Although this is code has fairly narrow scope, we might still want some tests to check we're not completely off. nbdev makes this super easy. Here we get that the we get back what we expect in terms of tuple length and that our urls look like urls.

{% raw %}
assert len(links[0]) == 2 #test tuple len
assert next(iter(set(map(urlvalid, map(itemgetter(1), links))))) == True #check second item valid url
{% endraw %} {% raw %}
assert len(links) == 10
assert type(links[0]) == tuple
assert (list(map(itemgetter(1), links))[-1]).startswith("https://")
{% endraw %} {% raw %}

get_download_urls[source]

get_download_urls(url:str)

Given a dataset page on the IRO repo return all download links for that page

{% endraw %} {% raw %}
{% endraw %}

get_download_urls takes a 'title' URL and then grabs all of the URLs for the zip files related to that title.

{% raw %}
get_download_urls("https://bl.iro.bl.uk/concern/datasets/93ec8ab4-3348-409c-bf6d-a9537156f654")
['https://bl.iro.bl.uk/downloads/9c24784d-56e6-44c1-bcc6-774fadc87718?locale=en',
 'https://bl.iro.bl.uk/downloads/5072df1a-75f3-4379-961a-59ac3566bc2f?locale=en',
 'https://bl.iro.bl.uk/downloads/e272c936-24ac-4702-bdee-483ec9b0c8be?locale=en',
 'https://bl.iro.bl.uk/downloads/9c4f2fd6-d58c-4a57-8fac-a5dd273f8ed3?locale=en',
 'https://bl.iro.bl.uk/downloads/e89ca9c4-b101-44bf-b1de-15052eb63d5e?locale=en',
 'https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en',
 'https://bl.iro.bl.uk/downloads/ebd5d9eb-e0ec-40b0-ae10-132cdfbaa4e1?locale=en',
 'https://bl.iro.bl.uk/downloads/7ac7a0cb-29a2-4172-8b79-4952e2c9b128?locale=en',
 'https://bl.iro.bl.uk/downloads/54d974ba-fcb2-4566-a5ac-b66d85954963?locale=en',
 'https://bl.iro.bl.uk/downloads/b40aabab-b366-4148-975e-4481d30ba182?locale=en',
 'https://bl.iro.bl.uk/downloads/17b6e110-8ed0-46cb-8030-6cc7f387ade5?locale=en',
 'https://bl.iro.bl.uk/downloads/319d5656-94b0-4cbf-8f0d-d3ce0aa3ab40?locale=en',
 'https://bl.iro.bl.uk/downloads/2997c3ff-323f-45e6-ac1c-4a147b7c78ff?locale=en',
 'https://bl.iro.bl.uk/downloads/3fd6b687-feb0-4d92-b8d7-4ea0acc5346c?locale=en',
 'https://bl.iro.bl.uk/downloads/5b450972-990c-4ed5-a979-2c3fef6d0c4a?locale=en',
 'https://bl.iro.bl.uk/downloads/30b3e2ac-2e49-410d-8635-dfa69b23f65c?locale=en',
 'https://bl.iro.bl.uk/downloads/aa8b9145-a7d9-4869-8f3e-07d864238ff0?locale=en',
 'https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en',
 'https://bl.iro.bl.uk/downloads/7c2cf32f-5767-4632-87d0-3001fc5689cc?locale=en',
 'https://bl.iro.bl.uk/downloads/050096c0-0166-4af4-89d7-29143ce8c73c?locale=en',
 'https://bl.iro.bl.uk/downloads/50ebdb11-9186-4c24-90e5-27caf73d3f11?locale=en',
 'https://bl.iro.bl.uk/downloads/0fd85a65-bfa3-4db8-8b92-7fc305cab4d4?locale=en',
 'https://bl.iro.bl.uk/downloads/a7a674bf-2517-4fbc-ad20-14d61646d80e?locale=en']
{% endraw %} {% raw %}

create_session[source]

create_session()

returns a requests session

{% endraw %} {% raw %}
{% endraw %}

create_session just adds some extra things to our Requests session to try and make it a little more robust. This is probably not necessary here but it can be useful to bump up the number of retries

{% raw %}
{% endraw %} {% raw %}
#     s = create_session()
#     r = s.get(url, stream=True, timeout=(30))
#     print("_".join(r.headers["Content-Disposition"].split('"')[1].split("_")[0:5]))
{% endraw %} {% raw %}
# r = s.get(test_url, stream=True, timeout=(30))
# "_".join(r.headers["Content-Disposition"].split('"')[1].split("_")[0:5])
{% endraw %}

This downloads a file and logs an exception if something goes wrong. Again we do a little test.

{% raw %}
test_url = "https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en"
Path("test_dir").mkdir()
test_dir = Path("test_dir")
_download(test_url, test_dir)
'BLNewspapers_0002642_TheExpress_1848_f1c4cb8d-6bd5-401f-831f-a19199d47c0a.zip'
{% endraw %} {% raw %}
assert list(test_dir.iterdir())[0].suffix == ".zip"
assert len(list(test_dir.iterdir())) == 1
# tidy up
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
{% endraw %} {% raw %}
bad_link = "https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=0ea7aa1-3b4f-4972-bc12-b75597694f"
_download(bad_link, "test_dir")
2021-10-15 11:12:11.647 | ERROR    | __main__:_download:17 - HTTPSConnectionPool(host='bl.oar.bl.uk', port=443): Max retries exceeded with url: /fail_uploads/download_file?fileset_id=0ea7aa1-3b4f-4972-bc12-b75597694f (Caused by SSLError(SSLCertVerificationError("hostname 'bl.oar.bl.uk' doesn't match either of '*.oar.notch8.cloud', 'oar.notch8.cloud'")))
{% endraw %} {% raw %}

download_from_urls[source]

download_from_urls(urls:List[str], save_dir:Union[str, Path], n_threads:int=8)

Downloads from an input lists of urls and saves to save_dir, option to set n_threads default = 8

{% endraw %} {% raw %}
{% endraw %}

download_from_urls takes a list of urls and downloads it to a specified directory

{% raw %}
test_links = [
    "https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en",
    "https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en",
]
{% endraw %} {% raw %}
download_from_urls(test_links, "test_dir")
 50%|██████████████████████████████████████████████████████████████████████████▌                                                                          | 1/2 [00:26<00:26, 26.91s/it]
2021-10-15 11:12:38.600 | INFO     | __main__:download_from_urls:22 - https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en downloaded to BLNewspapers_0002642_TheExpress_1848_f1c4cb8d-6bd5-401f-831f-a19199d47c0a.zip
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:27<00:00, 13.77s/it]
2021-10-15 11:12:39.225 | INFO     | __main__:download_from_urls:22 - https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en downloaded to BLNewspapers_0002642_TheExpress_1847_8f13ba53-0e13-4409-a384-830ba2b160db.zip

2
{% endraw %} {% raw %}
assert len(test_links) == len(os.listdir("test_dir"))
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
{% endraw %} {% raw %}
test_some_bad_links = [
    "https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=0ea7aa1f-3b4f-4972-bc12-b7559769471f",
    "https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=7ac7a0cb-29a2-4172-8b79-4952e2c9b",
]
download_from_urls(test_some_bad_links, "test_dir")
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.00s/it]
2021-10-15 11:12:49.308 | ERROR    | __main__:_download:17 - HTTPSConnectionPool(host='bl.oar.bl.uk', port=443): Max retries exceeded with url: /fail_uploads/download_file?fileset_id=0ea7aa1f-3b4f-4972-bc12-b7559769471f (Caused by SSLError(SSLCertVerificationError("hostname 'bl.oar.bl.uk' doesn't match either of '*.oar.notch8.cloud', 'oar.notch8.cloud'")))
2021-10-15 11:12:49.309 | ERROR    | __main__:_download:17 - HTTPSConnectionPool(host='bl.oar.bl.uk', port=443): Max retries exceeded with url: /fail_uploads/download_file?fileset_id=7ac7a0cb-29a2-4172-8b79-4952e2c9b (Caused by SSLError(SSLCertVerificationError("hostname 'bl.oar.bl.uk' doesn't match either of '*.oar.notch8.cloud', 'oar.notch8.cloud'")))

0
{% endraw %} {% raw %}
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
{% endraw %} {% raw %}

cli[source]

cli(save_dir:"Output Directory", n_threads:"Number threads to use"=8, subset:"Download subset of HMD"=None)

Download HMD newspaper from iro to save_dir using n_threads

{% endraw %} {% raw %}
{% endraw %}

We finally use fastcore to make a little CLI that we can use to download all of our files. We even get a little help flag for free 😀. We can either call this as a python function, or when we install the python package it gets registered as a console_scripts and can be used like other command line tools.

{% raw %}
 
[('The Express', 'https://bl.iro.bl.uk/concern/datasets/93ec8ab4-3348-409c-bf6d-a9537156f654?locale=en'), ('The Press.', 'https://bl.iro.bl.uk/concern/datasets/2f70fbcd-9530-496a-903f-dfa4e7b20d3b?locale=en'), ('The Star', 'https://bl.iro.bl.uk/concern/datasets/dd9873cf-cba1-4160-b1f9-ccdab8eb6312?locale=en'), ('National Register.', 'https://bl.iro.bl.uk/concern/datasets/f3ecea7f-7efa-4191-94ab-e4523384c182?locale=en'), ('The Statesman', 'https://bl.iro.bl.uk/concern/datasets/551cdd7b-580d-472d-8efb-b7f05cf64a11?locale=en'), ('The British Press; or, Morning Literary Advertiser', 'https://bl.iro.bl.uk/concern/datasets/aef16a3c-53b6-4203-ac08-d102cb54f8fa?locale=en'), ('The Sun', 'https://bl.iro.bl.uk/concern/datasets/b9a877b8-db7a-4e5f-afe6-28dc7d3ec988?locale=en'), ('The Liverpool Standard etc', 'https://bl.iro.bl.uk/concern/datasets/fb5e24e3-0ac9-4180-a1f4-268fc7d019c1?locale=en'), ('Colored News', 'https://bl.iro.bl.uk/concern/datasets/bacd53d6-86b7-4f8a-af31-0a12e8eaf6ee?locale=en'), ('The Northern Daily Times etc', 'https://bl.iro.bl.uk/concern/datasets/5243dccc-3fad-4a9e-a2c1-d07e750c46a6?locale=en')]
  0%|                                                                                                                                                             | 0/2 [00:00<?, ?it/s]
['https://bl.iro.bl.uk/downloads/d8be50c9-3fc7-4ff9-8e23-f591d6db641a?locale=en', 'https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en']
 50%|██████████████████████████████████████████████████████████████████████████▌                                                                          | 1/2 [00:11<00:11, 11.02s/it]
2021-10-15 11:13:06.802 | INFO     | __main__:download_from_urls:22 - https://bl.iro.bl.uk/downloads/d8be50c9-3fc7-4ff9-8e23-f591d6db641a?locale=en downloaded to BLNewspapers_0002194_TheSun_1802.zip
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.99s/it]
2021-10-15 11:13:17.757 | INFO     | __main__:download_from_urls:22 - https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en downloaded to BLNewspapers_0002642_TheExpress_1847_8f13ba53-0e13-4409-a384-830ba2b160db.zip

{% endraw %} {% raw %}
# assert len(list(Path("test_dir").iterdir())) == 2
{% endraw %} {% raw %}
from nbdev.export import notebook2script
notebook2script()
Converted 00_core.ipynb.
Converted index.ipynb.
{% endraw %} {% raw %}
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
{% endraw %}