Presentation Open Access
In February 2020, the Smithsonian Institution released almost 3 million images and over 12 million collections metadata records under the Creative Commons Zero (CC0) license. The release was made available via a web API, a GitHub repository, and via the Registry of Open Data on Amazon Web Services (AWS). The format of the release on the GitHub and AWS sources made the data well-suited for parallelized analysis, but only with deep knowledge of the complex data structures. In this talk we will discuss how we used the Python Dask library to unlock this parallelization and make the data more accessible, as well as a student intern project that used Python tools to uncover insights specifically into the holdings of the National Museum of American History.