Developing a tool supporting mapping metadata schemes for health data - The use of Large Language Models to support metadata mappings

Radice, Beatrice; Golebiewski, Martin; Müller, Wolfgang

doi:10.5281/zenodo.16736326

Published August 4, 2025 | Version v1

Conference paper Open

Developing a tool supporting mapping metadata schemes for health data - The use of Large Language Models to support metadata mappings

1. Heidelberg Institute for Theoretical Studies (HITS), Heidelberg

Contributors

Editor (2):

1. Nationale Forschungsdateninfrastruktur (NFDI) e.V.
2. University of Amsterdam

Mapping of metadata schemes is crucial for the interoperability of services implementing the schemes, however, manually mapping between various schemes can be a time-consuming and tedious task, especially if more than one schema should be mapped. To tackle this challenge, we're exploring the potential of Large Language Models (LLMs) to support and improve metadata mappings in the domain of health research. By leveraging the advanced capabilities of LLMs, we aim to develop a more efficient and accurate process for aligning metadata across different metadata standards. Our project focuses on mappings between the metadata schema (MDS) of the German National Research Data Infrastructure for Personal Health Data (NFDI4Health) [1],[2] and other resources in the domain of health research studies with their metadata schemes. The MDS has been implemented in NFDI4Health services, such as the German Central Health Study Hub [3] and the Local Data Hub (LDH) [4],[5] to make the health research data searchable and findable. To enhance the interoperability of these services we are working with organizations like the European Clinical Research Infrastructure Network (ECRIN) [6], the German Human Genome-Phenome Archive (GHGA) [7], and the European Platform on Rare Disease Registration (ERDRI) [8] and map our NFDI4Health metadata to their schemes. Through the help of LLMs such as Meta Llama and Qwen 2.5, we aim to identify meaningful connections between metadata elements in the different schemes and suggest possible matches between them. This will support metadata experts in their efforts to map data items between the schemes, as well as matching the value sets of corresponding items, making the mapping process much more efficient than the current workflow that fully relies on manual comparisons. We combine prompt engineering techniques with classical similarity measures. This hybrid strategy allows us to reduce errors and capture the nuances behind each metadata element. The results are then presented to human users for quality control, thus preparing, reducing and supporting manual work. This manual work is undertaken in order to assure mapping consistency and avoiding LLM hallucinations. The overall reduction of manual work makes the process more scalable. We are convinced that this work contributes to the development of a more interoperable, accurate, and interlinked metadata infrastructure in health research that is aligned to the FAIR principles.

Files

CoRDI_2025_paper_343.pdf

Files (61.9 kB)

Name	Size	Download all
CoRDI_2025_paper_343.pdf md5:250613552e3cc0cb0f0e00da9d2d482a	61.9 kB	Preview Download

	All versions	This version
Views	93	93
Downloads	118	118
Data volume	7.6 MB	7.6 MB

Developing a tool supporting mapping metadata schemes for health data - The use of Large Language Models to support metadata mappings

Authors/Creators

Contributors

Editor (2):

Description

Files

CoRDI_2025_paper_343.pdf

Files (61.9 kB)