An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

Published May 17, 2022 | Version camera ready

Conference paper Open

Dataset repositories publish a significant number of datasets

continuously within the context of a variety of domains, such as biodiversity

and oceanography. To conduct multidisciplinary research, scientists

and practitioners must discover datasets from various disciplines unfamiliar

with them. Well-known search engines, such as Google dataset and

Mendeley data, try to support researchers with cross-domain dataset

discovery based on their contents. However, as datasets typically contain

scientific observations or collected data from service providers, their

contextual information is limited. Accordingly, effective dataset indexing

can be impossible to increase the Findability, Accessibility, Interoperability,

and Reusability (FAIRness) based on their contextual information.

This paper presents an indexing pipeline to extend contextual information

of datasets based on their scientific domains by using topic modeling

and a set of suggested rules and domain keywords (such as essential variables

in environment science) based on domain experts’ suggestions. The

pipeline relies on an open ecosystem, where dataset providers publish

semantically enhanced metadata on their data repositories. We aggregate,

normalize, and reconcile such metadata, providing a dataset search

engine that enables research communities to find, access, integrate, and

reuse datasets. We evaluated our approach on a manually created gold

standard and a user study.

Files

Name	Size	Download all
2022.conference.akdd.caera.pdf md5:54a487ac165b3f6ea22f4c9b2c12ca0d	294.5 kB	Preview Download

European Commission
ARTICONF - smART socIal media eCOsytstem in a blockchaiN Federated environment 825134
European Commission
Blue Cloud - Blue-Cloud: Piloting innovative services for Marine Research & the Blue Economy 862409
European Commission
ENVRI-FAIR - ENVironmental Research Infrastructures building Fair services Accessible for society, Innovation and Research 824068