Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies

Kokash, Natallia; de Bono, Bernard

doi:10.5281/zenodo.15411810

Published May 14, 2025 | Version v1

Dataset Open

Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies

1. University of Amsterdam
2. Indiana University School of Medicine

This repository contains data and scripts for our research paper: 
"Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare"

The paper presents a generic LLM-based pipeline to enable data harmonization across distributed data sources. Collaboration with MPRINT project contributed to the motivation articulated in manuscript’s Section III (Aligning Biomedical Data Via Ontologies), as well as data for experiments and evaluation presented in Section IV (Data Alignment For Drug Reporting Use Case) of the paper.

On a broader footing, the overall pipeline becomes a function within the Federated Learning (FL) Brane/EPI framework (discussed in Section II). Such FL frameworks are deployed within the firewall of a health organization to convert data from their source format, to a target format expected by researchers designing federated studies, such that data privacy and integrity are not affected as the data per se is not copied.

In the MPRINT scenario presented in the paper, the goal was to map to the target Mondo Disease Ontology (MONDO) or Human Phenotype Ontology (HPO) format from source data that were either (a) not annotated, i.e., outcomes were given in plain English, or (b) annotated with an unrelated ontology, International Classification of Diseases, Version 10 (ICD-10). Here, we evaluate the performance of an LLM-based mapping pipeline to bridge source and target formats against a human operator.

- scripts
     - data
        - input
             - hp.json - Human Phenotype Ontology (HPO) (https://obofoundry.org/ontology/hp.html)
             - mondo.json - Mondo Disease Ontology (https://mondo.monarchinitiative.org/pages/download/)
             - snomed.txt - SNOMED CT content from snomed/Full/Refset/Map/der2_iisssccRefset_ExtendedMapFull_US1000124_20240901.txt" 
               (https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html
             - MPRINT_MarketScan_Phenotypes.csv - Input ICD10 codes from the MPRINT dataset  
        - output (files produced by methods in gen_snomed_mondo_hpo.py)
             - full_icd10_to_snomed.csv - Extracted ICD-10 to SNOMED code mapping
             - filtered_icd10_to_snomed.tsv - Subset of ICD-10 to SNOMED mapping relevant for our project
             - snomed_to_mondo.csv - SNOMED to MONDO code mapping from mondo.json
             - snomed_to_hpo.csv" - SNOMED to HPO code mapping from hp.json
             - icd10_to_mondo_hpo_snomed.tsv" - ICD-10 to MONDO/HPO mapping for input ICD-10 subset
             - icd_to_mondo_hpo_rag_llm.tsv - LLM acceptance of ICD-10 to MONDO/HPO candidate pairs, full set
        - eval 
             - subset-eval_icd10_to_mondo_hpo_rag_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced by RAG
             - subset-eval_icd10_to_mondo_hpo_snomed_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced via SNOMED mapping
    - gen_llm_rag_mondo_hpo.py - script to generate candidate matching pairs by vector similarity search (RAG))
    - gen_snomed_mondo_hpo.py - script to generate candidate matching pairs via SNOMED mapping
    - eval_llm.py - script to evaluate candidate matching pairs

Files

icd10-mondo-hpo-mapping.zip

Files (53.3 MB)

Name	Size	Download all
icd10-mondo-hpo-mapping.zip md5:fdfc696d64faa7fdcbb2d1ddac2a84ba	53.3 MB	Preview Download

Additional details

Submitted: 2025-05-14

Evaluation datasets and scripts released

Programming language: Python

	All versions	This version
Views	95	95
Downloads	10	10
Data volume	533.3 MB	533.3 MB

Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies

Files

icd10-mondo-hpo-mapping.zip

Files (53.3 MB)

Additional details

Dates

Software

Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies

Creators

Description

Files

icd10-mondo-hpo-mapping.zip

Files (53.3 MB)

Additional details

Dates

Software