Published January 22, 2025 | Version v1
Software Open

Generalizable and Scalable Multistage Biomedical Concept Normalization Leveraging Large Language Models

Description

Background: Biomedical entity normalization is critical to biomedical research because the richness of
free-text clinical data, such as progress notes, can often be fully leveraged only after translating words and
phrases into structured and coded representations suitable for analysis. Large Language Models (LLMs),
in turn, have shown great potential and high performance in a variety of natural language processing (NLP)
tasks, but their application for normalization remains understudied.
Methods: We applied both proprietary and open-source LLMs in combination with several rule-based nor-
malization systems commonly used in biomedical research. We used a two-step LLM integration approach,
(1) using an LLM to generate alternative phrasings of a source utterance, and (2) to prune candidate UMLS
concepts, using a variety of prompting methods. We measure results by Fβ , where we favor recall over
precision, and F1.
Results: We evaluated a total of 5,523 concept terms and text contexts from a publicly available dataset
of human-annotated biomedical abstracts. Incorporating GPT-3.5-turbo increased overall Fβ and F1 in nor-
malization systems +16.5 and +16.2 (OpenAI embeddings), +9.5 and +7.3 (MetaMapLite), +13.9 and +10.9
(QuickUMLS), and +10.5 and +10.3 (BM25), while the open-source Vicuna model achieved +20.2 and
+21.7 (OpenAI embeddings), +10.8 and +12.2 (MetaMapLite), +14.7 and +15 (QuickUMLS), and +15.6
and +18.7 (BM25).
Conclusions: Existing general-purpose LLMs, both propriety and open-source, can be leveraged to greatly
improve normalization performance using existing tools, with no fine-tuning.

Files

requirements.txt

Files (122.6 kB)

Name Size Download all
md5:1cb1b14c3d2910d15179b0079dd6a500
10.1 kB Download
md5:4a6d60515c816d34d870e4dd93eb5b98
9.4 kB Download
md5:d78d83b16c09e415dcf0f00b6c7acc45
9.6 kB Download
md5:0175c83f9f3e52e67c7659af7c7874ef
9.3 kB Download
md5:51364f39976e6461d542f0ff6d5b0679
1.1 kB Download
md5:2a46ae81a2c8b717b607a6475893a749
9.8 kB Download
md5:3dad2b4aee96333b74e3a87afcd0127a
2.3 kB Download
md5:eaed865be94c26c6abcb8ba9777cce6a
3.9 kB Download
md5:47a2de9d81b3f79d8e41750d87dfe5ed
2.6 kB Download
md5:4a6c49231154861fd2695494e76d0073
1.3 kB Download
md5:baaa83df39e53bf6556b05b471a6bce0
3.4 kB Download
md5:70aacf66769d971f6f265d13fc825dc9
4.6 kB Download
md5:4850fe7f944064ac247f41550b6c8d24
2.9 kB Download
md5:00a21d1bc27fc88bb55b6a9920b77b44
6.9 kB Download
md5:ac3e09d60184eea6dc9a65acf846ade7
5.0 kB Download
md5:b054a6530cd7de7f35d4e584874ef182
12.1 kB Download
md5:6d66ccd21dec531c9d08a7400361e0f9
3.3 kB Download
md5:f99e00d53e46a8ec76461812b909b184
3.9 kB Download
md5:fa479d9d2d18ebd571a2a6c1d380c0dc
1.2 kB Download
md5:422871b6cef94d237912376bcfa0a4ef
1.2 kB Download
md5:32651fb907e636eb7fb25cf3aa8d3adb
7.5 kB Download
md5:8bb520bca82efc1240a1cc9482c9d8aa
7.0 kB Download
md5:85bc60cc1181fcc88ef0ca0256794b73
3.9 kB Download
md5:68804a51a7e9ff02ab0acb9dfa54dd0b
95 Bytes Preview Download
md5:5045724a33e838fe646f58e5f7bef0fe
357 Bytes Download