Published October 8, 2025 | Version 1.0.0
Poster Open

AI-assisted research data annotation in biomedical consortia

Description

Annotation of research data is a key element of Open Science and has gained additional value as training input for artificial intelligence. However, developing metadata schemas poses a series of challenges, including optimisation and securing both complete coverage and constant completeness and quality. We employ large language models (LLMs) to  address some of these challenges while keeping researchers in the loop to ensure reliability of annotations.
Our research data management group currently supports seven biomedical research consortia.  We develop customised metadata schemas together with consortium members, drawing on established controlled vocabularies (Engel et al. 2025). Schemas are implemented on the fredato research data platform developed at the IMBI (Watter et al. 2023). Schemas are documented and published as knowledge graphs adhering to the Resource Description Framework (RDF), relating metadata to research processes as modelled by commonly used ontologies.
LLMs are employed to develop initial schema drafts from related research literature and to predict dataset annotations from scientific papers (Giuliani et al. 2025). The models have proved to perform well with these tasks, supporting researchers with improving metadata coverage in their consortia.

Files

2025-10-08-OpenScience_Conference-IMBI_Freiburg-print.pdf

Files (1.6 MB)

Additional details

Funding

Deutsche Forschungsgemeinschaft
Collaborative Research Centres

Dates

Submitted
2025-05-23
Accepted
2025-07-09

References

  • Engel, F., Benadi, G., Giuliani, C., Werner, J., Watter, M., Zeiser, R., Köttgen, A., Binder, H., & Kaier, K. (2025). Development of Metadata Schemas For Collaborative Research Centers. FreiData. https://doi.org/10.60493/K1XE3-NPC10
  • Watter, M., Kahle, L., Brunswiek, B., Fichtner, U., Pfaffenlehner, M., Werner, F., Gebele, D., Binder, H., & Knaus, J. (2023). Standardized metadata collection in a research data management tool to strengthen collaboration in Collaborative Research Centers. E-Science-Tage, Heidelberg. https://doi.org/10.11588/HEIDOK.00033131
  • Giuliani, C., Benadi, G., Engel, F., Werner, J., Watter, M., Schwarzer, G., Groß, O., Zeiser, R., Binder, H., & Kaier, K. (2025). Identifying biomedical entities for datasets in scientific articles – A 4-step cache-augmented generation approach using GPT-4o and PubTator 3.0. medRxiv. https://doi.org/10.1101/2025.03.04.25323310
  • Watter, M, Giuliani, C., Benadi, G., Engel, F., Binder, H., Kaier, K. (2025) Automated Identification of Contextually Relevant Biomedical Entities with Grounded LLMs. medRxiv. https://doi.org/10.1101/2025.07.07.25331004