Published October 14, 2021 | Version v1
Conference paper Open

Integration of Clowder Research Data Framework with NCSA Labs Workbench

  • 1. University of Illinois at Urbana-Champaign

Description

NCSA has two open-source applications focused on research data management and accessibility: Clowder is a scalable data repository with extensive metadata search capabilities and support for automated extraction of metadata from uploaded files, and Labs Workbench is an application catalog capable of registering and running instances of containerized research environments in the cloud. Both of these applications are designed around making research data available and interactable for a broad set of communities with minimal effort from the user. In the summer and fall of 2021, we are integrating these two applications together in order to enable users to seamlessly move data between Clowder instances where files are organized and Workbench applications where files can be examined and processed, and outputs can be shared back into the Clowder environment.
As scientific research has increasingly moved towards cloud computing, big data and dense software dependency trees, it has also become increasingly difficult for researchers to perfectly replicate the environments necessary to house and analyze these datasets. Big files are slow to move around, individual laptops may not have the necessary storage or processing power, and differences between operating systems cause inconsistencies. Clowder has a user-friendly GUI for uploading files and datasets, tagging and searching for them, and submitting them to extractors for processing. Labs Workbench provides a cloud management environment for containers running applications in browsers, such as Jupyter notebooks or GIS interfaces. We recognized an opportunity to move these complementary feature sets together to build a complete platform of data storage, sharing, analysis and discovery.
The goal of our integration is to provide a direct path for datasets to move between Clowder instances and Workbench applications easily and seamlessly, particularly in environments where data storage can be mounted on co-located Clowder and Workbench virtual machines. Users will see options in the Clowder interface to send their data to their chosen Workbench instance, while in Workbench the files will land in a shared home directory accessible between all of the user’s applications. Behind the scenes, we have leveraged Clowder’s existing interface for submitting datasets to extractors, which are containers running prepared scripts that do a single task like running text-to-speech on an audio file or creating a face recognition mask on a photograph. This extractor framework is widely used, and we wanted to provide simple ways for researchers to develop new extractors and share them back to the community. Workbench was a great opportunity to provide a simple path: researchers can move a sample subset of data from Clowder to Workbench, develop an algorithm to process meaningful metadata from it, and deploy that algorithm as a Clowder extractor in an identical container environment to process the full dataset. In cases where data are co-located on a shared mount, large datasets can be moved and processed instantly.

Files

Gateways2021_paper_30.pdf

Files (1.3 MB)

Name Size Download all
md5:80e2850334a979f24cbbea9c9387679f
1.3 MB Preview Download