Farshidi, Siamak
Zhao, Zhiming
2022-05-17
<p>Dataset repositories publish a significant number of datasets</p>
<p>continuously within the context of a variety of domains, such as biodiversity</p>
<p>and oceanography. To conduct multidisciplinary research, scientists</p>
<p>and practitioners must discover datasets from various disciplines unfamiliar</p>
<p>with them. Well-known search engines, such as Google dataset and</p>
<p>Mendeley data, try to support researchers with cross-domain dataset</p>
<p>discovery based on their contents. However, as datasets typically contain</p>
<p>scientific observations or collected data from service providers, their</p>
<p>contextual information is limited. Accordingly, effective dataset indexing</p>
<p>can be impossible to increase the Findability, Accessibility, Interoperability,</p>
<p>and Reusability (FAIRness) based on their contextual information.</p>
<p>This paper presents an indexing pipeline to extend contextual information</p>
<p>of datasets based on their scientific domains by using topic modeling</p>
<p>and a set of suggested rules and domain keywords (such as essential variables</p>
<p>in environment science) based on domain experts’ suggestions. The</p>
<p>pipeline relies on an open ecosystem, where dataset providers publish</p>
<p>semantically enhanced metadata on their data repositories. We aggregate,</p>
<p>normalize, and reconcile such metadata, providing a dataset search</p>
<p>engine that enables research communities to find, access, integrate, and</p>
<p>reuse datasets. We evaluated our approach on a manually created gold</p>
<p>standard and a user study.</p>
https://doi.org/10.1007/978-3-031-05936-0_37
oai:zenodo.org:6555644
Zenodo
https://zenodo.org/communities/envri
https://zenodo.org/communities/bluecloud
https://zenodo.org/communities/eu
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
PAKDD, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Chendu, China, May 16 2022
Dataset indexing ·
Dataset discovery ·
Inverted indexing ·
Metadata standard ·
Data repository
An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories
info:eu-repo/semantics/conferencePaper