Presentation Open Access

Making Smithsonian Open Access Accessible with Python and Dask

Trizna, Mike; McManus, Patrick

In February 2020, the Smithsonian Institution released almost 3 million images and over 12 million collections metadata records under the Creative Commons Zero (CC0) license. The release was made available via a web API, a GitHub repository, and via the Registry of Open Data on Amazon Web Services (AWS). The format of the release on the GitHub and AWS sources made the data well-suited for parallelized analysis, but only with deep knowledge of the complex data structures. In this talk we will discuss how we used the Python Dask library to unlock this parallelization and make the data more accessible, as well as a student intern project that used Python tools to uncover insights specifically into the holdings of the National Museum of American History.

Files (5.3 MB)
Name Size
slides.pdf
md5:419bb6075a92c390d99af912d9dca390
5.3 MB Download
32
26
views
downloads
All versions This version
Views 3232
Downloads 2626
Data volume 138.9 MB138.9 MB
Unique views 2828
Unique downloads 2525

Share

Cite as