Published August 4, 2025
| Version v1
Conference paper
Open
Developing a tool supporting mapping metadata schemes for health data - The use of Large Language Models to support metadata mappings
Authors/Creators
- 1. Heidelberg Institute for Theoretical Studies (HITS), Heidelberg
Contributors
Editor (2):
- 1. Nationale Forschungsdateninfrastruktur (NFDI) e.V.
- 2. University of Amsterdam
Description
Mapping of metadata schemes is crucial for the interoperability of services implementing the schemes, however, manually mapping between various schemes can be a time-consuming and tedious task, especially if more than one schema should be mapped. To tackle this challenge, we're exploring the potential of Large Language Models (LLMs) to support and improve metadata mappings in the domain of health research. By leveraging the advanced capabilities of LLMs, we aim to develop a more efficient and accurate process for aligning metadata across different metadata standards. Our project focuses on mappings between the metadata schema (MDS) of the German National Research Data Infrastructure for Personal Health Data (NFDI4Health) [1],[2] and other resources in the domain of health research studies with their metadata schemes. The MDS has been implemented in NFDI4Health services, such as the German Central Health Study Hub [3] and the Local Data Hub (LDH) [4],[5] to make the health research data searchable and findable. To enhance the interoperability of these services we are working with organizations like the European Clinical Research Infrastructure Network (ECRIN) [6], the German Human Genome-Phenome Archive (GHGA) [7], and the European Platform on Rare Disease Registration (ERDRI) [8] and map our NFDI4Health metadata to their schemes. Through the help of LLMs such as Meta Llama and Qwen 2.5, we aim to identify meaningful connections between metadata elements in the different schemes and suggest possible matches between them. This will support metadata experts in their efforts to map data items between the schemes, as well as matching the value sets of corresponding items, making the mapping process much more efficient than the current workflow that fully relies on manual comparisons. We combine prompt engineering techniques with classical similarity measures. This hybrid strategy allows us to reduce errors and capture the nuances behind each metadata element. The results are then presented to human users for quality control, thus preparing, reducing and supporting manual work. This manual work is undertaken in order to assure mapping consistency and avoiding LLM hallucinations. The overall reduction of manual work makes the process more scalable. We are convinced that this work contributes to the development of a more interoperable, accurate, and interlinked metadata infrastructure in health research that is aligned to the FAIR principles.
Files
CoRDI_2025_paper_343.pdf
Files
(61.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:250613552e3cc0cb0f0e00da9d2d482a
|
61.9 kB | Preview Download |