--- title: hmd_newspaper_dl keywords: fastai sidebar: home_sidebar summary: "Download Heritage made Digital Newspaper from the BL repository " description: "Download Heritage made Digital Newspaper from the BL repository " nb_path: "00_core.ipynb" ---
The aim of this code is to make it easier to download all of the Heritage Made Digital Newspapers from the British Library's Research Repository.
The Newspapers are currently organised by newspaper title under a collection:
Under each titles you can download a zip file representing a year for that particular newspaper title
If we only want a subset of year or titles we could download these manually but if we're interested in using computational methods it's a bit slow. What we need to do is grab all of the URL's for each title so we can bulk download them all.
This is a smaller helper function that will generate the correct url once we have got an ID for a title.
This function starts from the Newspaper collection and then uses BeatifulSoup to scrape all of the URLs which link to a newspaper title. We have a hard coded URL here which isn't very good practice but since we're writing this code for a fairly narrow purpose we'll not worry about that here.
If we call this function we get a bunch of links back.
links = get_newspaper_links()
links
Although this is code has fairly narrow scope, we might still want some tests to check we're not completely off. nbdev
makes this super easy. Here we get that the we get back what we expect in terms of tuple length and that our urls look like urls.
assert len(links[0]) == 2 #test tuple len
assert next(iter(set(map(urlvalid, map(itemgetter(1), links))))) == True #check second item valid url
assert len(links) == 10
assert type(links[0]) == tuple
assert (list(map(itemgetter(1), links))[-1]).startswith("https://")
get_download_urls
takes a 'title' URL and then grabs all of the URLs for the zip files related to that title.
get_download_urls("https://bl.iro.bl.uk/concern/datasets/93ec8ab4-3348-409c-bf6d-a9537156f654")
create_session
just adds some extra things to our Requests
session to try and make it a little more robust. This is probably not necessary here but it can be useful to bump up the number of retries
# s = create_session()
# r = s.get(url, stream=True, timeout=(30))
# print("_".join(r.headers["Content-Disposition"].split('"')[1].split("_")[0:5]))
# r = s.get(test_url, stream=True, timeout=(30))
# "_".join(r.headers["Content-Disposition"].split('"')[1].split("_")[0:5])
This downloads a file and logs an exception if something goes wrong. Again we do a little test.
test_url = "https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en"
Path("test_dir").mkdir()
test_dir = Path("test_dir")
_download(test_url, test_dir)
assert list(test_dir.iterdir())[0].suffix == ".zip"
assert len(list(test_dir.iterdir())) == 1
# tidy up
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
bad_link = "https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=0ea7aa1-3b4f-4972-bc12-b75597694f"
_download(bad_link, "test_dir")
download_from_urls
takes a list of urls and downloads it to a specified directory
test_links = [
"https://bl.iro.bl.uk/downloads/0ea7aa1f-3b4f-4972-bc12-b7559769471f?locale=en",
"https://bl.iro.bl.uk/downloads/80708825-d96a-4301-9496-9598932520f4?locale=en",
]
download_from_urls(test_links, "test_dir")
assert len(test_links) == len(os.listdir("test_dir"))
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
test_some_bad_links = [
"https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=0ea7aa1f-3b4f-4972-bc12-b7559769471f",
"https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=7ac7a0cb-29a2-4172-8b79-4952e2c9b",
]
download_from_urls(test_some_bad_links, "test_dir")
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()
We finally use fastcore
to make a little CLI that we can use to download all of our files. We even get a little help flag for free 😀. We can either call this as a python function, or when we install the python package it gets registered as a console_scripts
and can be used like other command line tools.
# assert len(list(Path("test_dir").iterdir())) == 2
from nbdev.export import notebook2script
notebook2script()
test_dir = Path("test_dir")
[f.unlink() for f in test_dir.iterdir()]
test_dir.rmdir()