BERT-based Symptom Extraction NER Pipeline

Mlakar, Izidor; Sallauka, Rigon; Arioz, Umut; Rojc, Matej

doi:10.5281/zenodo.13918323

Published October 11, 2024 | Version v1

Software Open

BERT-based Symptom Extraction NER Pipeline

1. University of Maribor, Faculty of Electrical Engineering and Computer Science, HUMADEX Research Group

Weakly Supervised NER pipeline

The NER pipeline is designed to automatically annotate medical text in English and extend this functionality to seven additional languages. The core components include annotation using a clinical model, translation of the annotated text, and fine-tuning language-specific models.

Pipeline Workflow

Annotation of English Data

English medical texts are annotated using the Stanza clinical model. The following entity tags are used: PROBLEM, TEST, and TREATMENT.

Translation into Multiple Languages

The annotated English dataset is translated into seven languages: German Italian Spanish Greek Slovenian Polish Portuguese

Fine-tuning Multilingual BERT Models

Language-specific BERT models are fine-tuned using the translated datasets. The fine-tuning process adapts each model to recognize symptoms, tests, and treatments in its respective language.

Supported Languages

The pipeline supports the following languages:

English (base language)
German
Italian
Spanish
Greek
Slovenian
Polish
Portuguese

Pre-Trained Models:

The pre-trained models can be found:

English: https://huggingface.co/HUMADEX/english_medical_ner
German: https://huggingface.co/HUMADEX/german_medical_ner
Italian: https://huggingface.co/HUMADEX/italian_medical_ner
Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
Greek: https://huggingface.co/HUMADEX/greek_medical_ner
Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
Polish: https://huggingface.co/HUMADEX/polish_medical_ner
Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

Visit weakly-supervised-multi-lingual-ner-pipeline collection in HuggingFace Hub to see the models and datasets.

Acknowledgement:

This code had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

Please cite as:

Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1

Files

Files (122.1 kB)

Name	Size	Download all
MedicalNER.docx md5:f4abb2432f28daa85d74973a6b5e2164	122.1 kB	Download

Additional details

European Commission
SMILE - Supporting Mental Health in Young People: Integrated Methodology for cLinical dEcisions and evidence-based interventions 101080923
European Commission
BosomShield - A comprehensive CAD system based on radiologic- and pathologic-image biomarkers for diagnosis and prognosis of breast cancer relapse 101073222

Repository URL: https://github.com/HUMADEX/Weekly-Supervised-NER-pipline
Programming language: Python
Development Status: Active

	All versions	This version
Views	152	152
Downloads	31	31
Data volume	3.8 MB	3.8 MB

EU Open Research Repository

EU Open Research Repository

Research and Innovation

BERT-based Symptom Extraction NER Pipeline

Weakly Supervised NER pipeline

Pipeline Workflow

Annotation of English Data

Translation into Multiple Languages

Fine-tuning Multilingual BERT Models

Supported Languages

Pre-Trained Models:

Acknowledgement:

Files

Files (122.1 kB)

Additional details

Funding

Software

About

Submission

EU Open Research Repository

EU Open Research Repository

Research and Innovation

BERT-based Symptom Extraction NER Pipeline

Creators

Description

Weakly Supervised NER pipeline

Pipeline Workflow

Annotation of English Data

Translation into Multiple Languages

Fine-tuning Multilingual BERT Models

Supported Languages

Pre-Trained Models:

Acknowledgement:

Files

Files (122.1 kB)

Additional details

Funding

Software