Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies
Creators
Description
This repository contains data and scripts for our research paper:
"Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare"
The paper presents a generic LLM-based pipeline to enable data harmonization across distributed data sources. Collaboration with MPRINT project contributed to the motivation articulated in manuscript’s Section III (Aligning Biomedical Data Via Ontologies), as well as data for experiments and evaluation presented in Section IV (Data Alignment For Drug Reporting Use Case) of the paper.
On a broader footing, the overall pipeline becomes a function within the Federated Learning (FL) Brane/EPI framework (discussed in Section II). Such FL frameworks are deployed within the firewall of a health organization to convert data from their source format, to a target format expected by researchers designing federated studies, such that data privacy and integrity are not affected as the data per se is not copied.
In the MPRINT scenario presented in the paper, the goal was to map to the target Mondo Disease Ontology (MONDO) or Human Phenotype Ontology (HPO) format from source data that were either (a) not annotated, i.e., outcomes were given in plain English, or (b) annotated with an unrelated ontology, International Classification of Diseases, Version 10 (ICD-10). Here, we evaluate the performance of an LLM-based mapping pipeline to bridge source and target formats against a human operator.
- scripts
- data
- input
- hp.json - Human Phenotype Ontology (HPO) (https://obofoundry.org/ontology/hp.html)
- mondo.json - Mondo Disease Ontology (https://mondo.monarchinitiative.org/pages/download/)
- snomed.txt - SNOMED CT content from snomed/Full/Refset/Map/der2_iisssccRefset_ExtendedMapFull_US1000124_20240901.txt"
(https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html
- MPRINT_MarketScan_Phenotypes.csv - Input ICD10 codes from the MPRINT dataset
- output (files produced by methods in gen_snomed_mondo_hpo.py)
- full_icd10_to_snomed.csv - Extracted ICD-10 to SNOMED code mapping
- filtered_icd10_to_snomed.tsv - Subset of ICD-10 to SNOMED mapping relevant for our project
- snomed_to_mondo.csv - SNOMED to MONDO code mapping from mondo.json
- snomed_to_hpo.csv" - SNOMED to HPO code mapping from hp.json
- icd10_to_mondo_hpo_snomed.tsv" - ICD-10 to MONDO/HPO mapping for input ICD-10 subset
- icd_to_mondo_hpo_rag_llm.tsv - LLM acceptance of ICD-10 to MONDO/HPO candidate pairs, full set
- eval
- subset-eval_icd10_to_mondo_hpo_rag_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced by RAG
- subset-eval_icd10_to_mondo_hpo_snomed_llm-vs-human.tsv - LLM vs human acceptance evaluation for a subset of candidate pairs produced via SNOMED mapping
- gen_llm_rag_mondo_hpo.py - script to generate candidate matching pairs by vector similarity search (RAG))
- gen_snomed_mondo_hpo.py - script to generate candidate matching pairs via SNOMED mapping
- eval_llm.py - script to evaluate candidate matching pairs
Files
icd10-mondo-hpo-mapping.zip
Files
(53.3 MB)
Name | Size | Download all |
---|---|---|
md5:fdfc696d64faa7fdcbb2d1ddac2a84ba
|
53.3 MB | Preview Download |
Additional details
Dates
- Submitted
-
2025-05-14Evaluation datasets and scripts released