Conference paper Open Access

An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

Farshidi, Siamak; Zhao, Zhiming

Dataset repositories publish a significant number of datasets

continuously within the context of a variety of domains, such as biodiversity

and oceanography. To conduct multidisciplinary research, scientists

and practitioners must discover datasets from various disciplines unfamiliar

with them. Well-known search engines, such as Google dataset and

Mendeley data, try to support researchers with cross-domain dataset

discovery based on their contents. However, as datasets typically contain

scientific observations or collected data from service providers, their

contextual information is limited. Accordingly, effective dataset indexing

can be impossible to increase the Findability, Accessibility, Interoperability,

and Reusability (FAIRness) based on their contextual information.

This paper presents an indexing pipeline to extend contextual information

of datasets based on their scientific domains by using topic modeling

and a set of suggested rules and domain keywords (such as essential variables

in environment science) based on domain experts’ suggestions. The

pipeline relies on an open ecosystem, where dataset providers publish

semantically enhanced metadata on their data repositories. We aggregate,

normalize, and reconcile such metadata, providing a dataset search

engine that enables research communities to find, access, integrate, and

reuse datasets. We evaluated our approach on a manually created gold

standard and a user study.

Files (294.5 kB)
Name Size
2022.conference.akdd.caera.pdf
md5:54a487ac165b3f6ea22f4c9b2c12ca0d
294.5 kB Download
91
46
views
downloads
Views 91
Downloads 46
Data volume 13.5 MB
Unique views 80
Unique downloads 43

Share

Cite as