Generalizable and Scalable Multistage Biomedical Concept Normalization Leveraging Large Language Models

Dobbins, Nicholas J

doi:10.48550/arXiv.2405.15122

Published January 22, 2025 | Version v1

Software Open

Generalizable and Scalable Multistage Biomedical Concept Normalization Leveraging Large Language Models

Dobbins, Nicholas J

Background: Biomedical entity normalization is critical to biomedical research because the richness of
free-text clinical data, such as progress notes, can often be fully leveraged only after translating words and
phrases into structured and coded representations suitable for analysis. Large Language Models (LLMs),
in turn, have shown great potential and high performance in a variety of natural language processing (NLP)
tasks, but their application for normalization remains understudied.
Methods: We applied both proprietary and open-source LLMs in combination with several rule-based nor-
malization systems commonly used in biomedical research. We used a two-step LLM integration approach,
(1) using an LLM to generate alternative phrasings of a source utterance, and (2) to prune candidate UMLS
concepts, using a variety of prompting methods. We measure results by Fβ , where we favor recall over
precision, and F1.
Results: We evaluated a total of 5,523 concept terms and text contexts from a publicly available dataset
of human-annotated biomedical abstracts. Incorporating GPT-3.5-turbo increased overall Fβ and F1 in nor-
malization systems +16.5 and +16.2 (OpenAI embeddings), +9.5 and +7.3 (MetaMapLite), +13.9 and +10.9
(QuickUMLS), and +10.5 and +10.3 (BM25), while the open-source Vicuna model achieved +20.2 and
+21.7 (OpenAI embeddings), +10.8 and +12.2 (MetaMapLite), +14.7 and +15 (QuickUMLS), and +15.6
and +18.7 (BM25).
Conclusions: Existing general-purpose LLMs, both propriety and open-source, can be leveraged to greatly
improve normalization performance using existing tools, with no fine-tuning.

Files

requirements.txt

Files (122.6 kB)

Name	Size	Download all
add_embeddings_source.py md5:1cb1b14c3d2910d15179b0079dd6a500	10.1 kB	Download
add_embeddings_source_yesno.py md5:4a6d60515c816d34d870e4dd93eb5b98	9.4 kB	Download
add_embeddings_source_yesno_gpt.py md5:d78d83b16c09e415dcf0f00b6c7acc45	9.6 kB	Download
add_embeddings_source_yesno_llama2.py md5:0175c83f9f3e52e67c7659af7c7874ef	9.3 kB	Download
add_source_to_single_prompt_results.py md5:51364f39976e6461d542f0ff6d5b0679	1.1 kB	Download
analyze_prompt_variant_results.py md5:2a46ae81a2c8b717b607a6475893a749	9.8 kB	Download
derive_medmentions_gold_json.py md5:3dad2b4aee96333b74e3a87afcd0127a	2.3 kB	Download
derive_medmentions_train_txt.py md5:eaed865be94c26c6abcb8ba9777cce6a	3.9 kB	Download
derive_semeval_gold_json.py md5:47a2de9d81b3f79d8e41750d87dfe5ed	2.6 kB	Download
embeddings_cache.py md5:4a6c49231154861fd2695494e76d0073	1.3 kB	Download
error_analysis.py md5:baaa83df39e53bf6556b05b471a6bce0	3.4 kB	Download
generate_prompt_variant_data.py md5:70aacf66769d971f6f265d13fc825dc9	4.6 kB	Download
get_gold_embeddings.py md5:4850fe7f944064ac247f41550b6c8d24	2.9 kB	Download
get_gpt_synonyms.py md5:00a21d1bc27fc88bb55b6a9920b77b44	6.9 kB	Download
get_llama2_synonyms.py md5:ac3e09d60184eea6dc9a65acf846ade7	5.0 kB	Download
get_retrieval_perf_bounds.py md5:b054a6530cd7de7f35d4e584874ef182	12.1 kB	Download
get_umls_embeddings.py md5:6d66ccd21dec531c9d08a7400361e0f9	3.3 kB	Download
gpt_prompt_variants_yesno.py md5:f99e00d53e46a8ec76461812b909b184	3.9 kB	Download
insert_gold_into_vectordb.py md5:fa479d9d2d18ebd571a2a6c1d380c0dc	1.2 kB	Download
insert_umls_into_vectordb.py md5:422871b6cef94d237912376bcfa0a4ef	1.2 kB	Download
llama2_prompt_variants.py md5:32651fb907e636eb7fb25cf3aa8d3adb	7.5 kB	Download
llama2_prompt_variants_by_source.py md5:8bb520bca82efc1240a1cc9482c9d8aa	7.0 kB	Download
llama2_prompt_variants_yesno.py md5:85bc60cc1181fcc88ef0ca0256794b73	3.9 kB	Download
requirements.txt md5:68804a51a7e9ff02ab0acb9dfa54dd0b	95 Bytes	Preview Download
run_3a_3b.sh md5:5045724a33e838fe646f58e5f7bef0fe	357 Bytes	Download

	All versions	This version
Views	30	30
Downloads	447	447
Data volume	2.2 MB	2.2 MB

Generalizable and Scalable Multistage Biomedical Concept Normalization Leveraging Large Language Models

Creators

Description

Files

requirements.txt

Files (122.6 kB)