Published September 25, 2023 | Version v1
Presentation Open

Next generation research data repositories: bringing computation to data for exploratory data analysis and visualisation

Description

Data repositories allow open access to research data, but most of the repository platforms have limitations reducing FAIRness. Lack of support for folders results in datasets to be published as large and compressed archive files that reduces accessibility and interoperability. Similarly, limited support for random data access prevents cloud-native formats to be utilized effectively and efficiently through the repositories. Because data preview capabilities of the repositories are also limited, in practice, the researchers need to download large datasets and find ways to explore them to understand their content and quality.

Open Data Explorer aims to lessen these needs and facilitate rapid exploratory data analysis and visualisation of research data by providing a ready-to-use interactive computing platform where research data is directly available for computing. Each user is provided with a dedicated computing environment based on JupyterLab that supports interactive notebooks and a rich set of data access, analysis, and visualization packages in multiple languages (e.g., Python, R). To enable zero-waiting time access to a large number of datasets, the platform caches research data in a useful state (i.e., uncompressed) on-demand and also proactively by monitoring popular and new datasets available on selected data repositories. Example exploratory data analysis notebooks are automatically generated for each dataset based on its file types (e.g., CVS, NetCDF, GeoTIFF) and they are further tailored according to the dataset content so that they can be directly used for analysis with minimum user input. With the provided features, the open-source Open Data Explorer platform prevents unnecessary and ineffective downloading of datasets and reduces the time to explore research data. It also resolves the need for a separate computing environment to explore research data, and moreover facilitates the exploration task through tailor-made template notebooks. This talk will provide in-depth information about the architecture of the platform and its capabilities. The platform will also be demonstrated by using different research data repositories and research datasets.

Files

20230925-SURF-DCC-OpenDataExplorer.pdf

Files (1.6 MB)

Name Size Download all
md5:3edfcb0deccce89e8a12a4ebc51087bd
1.6 MB Preview Download

Additional details

Funding

SURF-DCC Pilots 2022 221219-RoAuKaSw-005
SURF