Enhancing BERT Performance with LLMs: Structured Data Augmentation for Biomedical Entity Recognition
Description
Abstract
Large Language Models (LLMs) have shown remarkable capabilities across many NLP tasks, but their performance on domain-specific named entity recognition (NER), such as in the biomedical field, remains limited. Meanwhile, BERT-based models continue to achieve strong results in biomedical NER but require substantial amounts of high-quality annotated data. In this work, we investigate how to harness LLMs to generate auxiliary annotation data for BERT-based NER models, offering a cost-effective alternative to manual annotation. We address three key research questions: (1) whether LLMs or fine-tuned BERT models provide more effective weak supervision for improving BERT-based NER, (2) how to best integrate augmented and gold-standard data during training, and (3) how factors such as data source and augmentation size affect downstream performance. In particular, we introduce a structured supervision framework where an LLM is fine-tuned to generate entity annotations in a context-rich JSON format, which are decoded into token-level labels for BERT training. Experimental results on the biomedical NER dataset show that LLM-generated auxiliary annotation data effectively enhances BERT performance. Our findings provide practical insights into designing hybrid systems that combine LLMs and BERT for scalable, high-quality biomedical NER.
This article is part of the Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI).
Files
BC9_paper06.pdf
Files
(168.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:5f221aa047671664922fd1cc8a05d662
|
168.0 kB | Preview Download |