There is a newer version of the record available.

Published August 4, 2024 | Version v1
Presentation Open

Semantic Search and Natural Language Query over HDF5

  • 1. ROR icon Lawrence Berkeley National Laboratory

Description

The ability to effectively query HDF5 files is a prerequisite for fully leveraging their potential. Over the years, a series of lexical matching solutions have been proposed to address the metadata search problem of HDF5 files. However, these traditional lexical matching approaches often ignore the semantic relationship between the query and the actual metadata/data in the datasets. With such systems, users need a deep understanding of the format and structure of their data as well as their true intentions when finding data of interest. Therefore, it is necessary to provide a metadata querying mechanism that captures the semantic meaning of every query, bridging the gap between the true intentions of user queries and the actual data of interest.

Toward the goal of providing an advanced search service for HDF5 files, our research has progressed through several key stages to address different challenges. Initially, with kv2vec and PSQS (Parallel Semantic Querying Service), we moved beyond lexical matching to semantic search, focusing on capturing the semantics of keywords. Our method captures keywords from metadata attributes, enables the semantization of metadata, and performs semantic searches over scientific datasets. As our work evolved, we recognized the necessity of handling complete sentence inputs rather than solely keywords. This shift from keyword-based searches to full-sentence queries underscores the increasing complexity and capability of our methods. By leveraging large language models (LLMs), our new approach can process natural language queries and return the desired results on scientific files, significantly enhancing the efficiency of scientific data discovery and elevating scientific data management to a new level. This advancement holds substantial potential for revolutionizing scientific data discovery within the HDF5 community and beyond.

Files

Files (15.4 MB)