Published May 14, 2025 | Version v1
Dataset Open

Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies

  • 1. ROR icon University of Amsterdam
  • 2. ROR icon Indiana University School of Medicine

Description

This repository contains data and scripts for our research paper: 
"Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare"

The paper presents a generic LLM-based pipeline to enable data harmonization across distributed data sources. Collaboration with MPRINT project contributed to the motivation articulated in manuscript’s Section III (Aligning Biomedical Data Via Ontologies), as well as data for experiments and evaluation presented in Section IV (Data Alignment For Drug Reporting Use Case) of the paper.

On a broader footing, the overall pipeline becomes a function within the Federated Learning (FL) Brane/EPI framework (discussed in Section II). Such FL frameworks are deployed within the firewall of a health organization to convert data from their source format, to a target format expected by researchers designing federated studies, such that data privacy and integrity are not affected as the data per se is not copied. 

In the MPRINT scenario presented in the paper, the goal was to map to the target Mondo Disease Ontology (MONDO) or Human Phenotype Ontology (HPO) format from source data that were either (a) not annotated, i.e., outcomes were given in plain English, or (b) annotated with an unrelated ontology, International Classification of Diseases, Version 10 (ICD-10). Here, we evaluate the performance of an LLM-based mapping pipeline to bridge source and target formats against a human operator.

- scripts
- data
- input
- hp.json - Human Phenotype Ontology (HPO) (https://obofoundry.org/ontology/hp.html)
- mondo.json - Mondo Disease Ontology (https://mondo.monarchinitiative.org/pages/download/)
- snomed.txt - SNOMED CT content from snomed/Full/Refset/Map/der2_iisssccRefset_ExtendedMapFull_US1000124_20240901.txt"
(https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html
- MPRINT_MarketScan_Phenotypes.csv - Input ICD10 codes from the MPRINT dataset
- output (files produced by methods in gen_snomed_mondo_hpo.py)
- full_icd10_to_snomed.csv - Extracted ICD-10 to SNOMED code mapping
- filtered_icd10_to_snomed.tsv - Subset of ICD-10 to SNOMED mapping relevant for our project
- snomed_to_mondo.csv - SNOMED to MONDO code mapping from mondo.json
- snomed_to_hpo.csv" - SNOMED to HPO code mapping from hp.json
- icd10_to_mondo_hpo_snomed.tsv" - ICD-10 to MONDO/HPO mapping for input ICD-10 subset
- icd_to_mondo_hpo_rag_llm.tsv - LLM acceptance of ICD-10 to MONDO/HPO candidate pairs, full set
- eval
- subset-eval_icd10_to_mondo_hpo_rag_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced by RAG
- subset-eval_icd10_to_mondo_hpo_snomed_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced via SNOMED mapping
- gen_llm_rag_mondo_hpo.py - script to generate candidate matching pairs by vector similarity search (RAG))
- gen_snomed_mondo_hpo.py - script to generate candidate matching pairs via SNOMED mapping
- eval_llm.py - script to evaluate candidate matching pairs

 

Files

icd10-mondo-hpo-mapping.zip

Files (53.3 MB)

Name Size Download all
md5:fdfc696d64faa7fdcbb2d1ddac2a84ba
53.3 MB Preview Download

Additional details

Dates

Submitted
2025-05-14
Evaluation datasets and scripts released

Software

Programming language
Python