BERT-based Symptom Extraction NER Pipeline
Creators
- 1. University of Maribor, Faculty of Electrical Engineering and Computer Science, HUMADEX Research Group
Description
Weakly Supervised NER pipeline
The NER pipeline is designed to automatically annotate medical text in English and extend this functionality to seven additional languages. The core components include annotation using a clinical model, translation of the annotated text, and fine-tuning language-specific models.
Pipeline Workflow
Annotation of English Data
English medical texts are annotated using the Stanza clinical model. The following entity tags are used: PROBLEM, TEST, and TREATMENT.Translation into Multiple Languages
The annotated English dataset is translated into seven languages: German Italian Spanish Greek Slovenian Polish PortugueseFine-tuning Multilingual BERT Models
Language-specific BERT models are fine-tuned using the translated datasets. The fine-tuning process adapts each model to recognize symptoms, tests, and treatments in its respective language.Supported Languages
The pipeline supports the following languages:- English (base language)
- German
- Italian
- Spanish
- Greek
- Slovenian
- Polish
- Portuguese
Pre-Trained Models:
The pre-trained models can be found:
- English: https://huggingface.co/HUMADEX/english_medical_ner
- German: https://huggingface.co/HUMADEX/german_medical_ner
- Italian: https://huggingface.co/HUMADEX/italian_medical_ner
- Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
- Greek: https://huggingface.co/HUMADEX/greek_medical_ner
- Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
- Polish: https://huggingface.co/HUMADEX/polish_medical_ner
- Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner
Visit weakly-supervised-multi-lingual-ner-pipeline collection in HuggingFace Hub to see the models and datasets.
Acknowledgement:
This code had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.
Please cite as:
Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1
Files
Files
(122.1 kB)
Name | Size | Download all |
---|---|---|
md5:f4abb2432f28daa85d74973a6b5e2164
|
122.1 kB | Download |
Additional details
Funding
- European Commission
- SMILE - Supporting Mental Health in Young People: Integrated Methodology for cLinical dEcisions and evidence-based interventions 101080923
- European Commission
- BosomShield - A comprehensive CAD system based on radiologic- and pathologic-image biomarkers for diagnosis and prognosis of breast cancer relapse 101073222
Software
- Repository URL
- https://github.com/HUMADEX/Weekly-Supervised-NER-pipline
- Programming language
- Python
- Development Status
- Active