Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published May 4, 2021 | Version v1
Presentation Open

Making Smithsonian Open Access Accessible with Python and Dask

  • 1. Smithsonian Institution
  • 2. George Mason University

Description

In February 2020, the Smithsonian Institution released almost 3 million images and over 12 million collections metadata records under the Creative Commons Zero (CC0) license. The release was made available via a web API, a GitHub repository, and via the Registry of Open Data on Amazon Web Services (AWS). The format of the release on the GitHub and AWS sources made the data well-suited for parallelized analysis, but only with deep knowledge of the complex data structures. In this talk we will discuss how we used the Python Dask library to unlock this parallelization and make the data more accessible, as well as a student intern project that used Python tools to uncover insights specifically into the holdings of the National Museum of American History.

Files

slides.pdf

Files (5.3 MB)

Name Size Download all
md5:419bb6075a92c390d99af912d9dca390
5.3 MB Preview Download