Published October 11, 2024 | Version v1
Software Open

BERT-based Symptom Extraction NER Pipeline

  • 1. University of Maribor, Faculty of Electrical Engineering and Computer Science, HUMADEX Research Group

Description

Weakly Supervised NER pipeline

The NER pipeline is designed to automatically annotate medical text in English and extend this functionality to seven additional languages. The core components include annotation using a clinical model, translation of the annotated text, and fine-tuning language-specific models.

Pipeline Workflow

 

Annotation of English Data

English medical texts are annotated using the Stanza clinical model. The following entity tags are used: PROBLEM, TEST, and TREATMENT.

Translation into Multiple Languages

The annotated English dataset is translated into seven languages: German Italian Spanish Greek Slovenian Polish Portuguese

Fine-tuning Multilingual BERT Models

Language-specific BERT models are fine-tuned using the translated datasets. The fine-tuning process adapts each model to recognize symptoms, tests, and treatments in its respective language.

Supported Languages

The pipeline supports the following languages:
  • English (base language)
  • German
  • Italian
  • Spanish
  • Greek
  • Slovenian
  • Polish
  • Portuguese

Pre-Trained Models:

The pre-trained models can be found: 

  • English: https://huggingface.co/HUMADEX/english_medical_ner
  • German: https://huggingface.co/HUMADEX/german_medical_ner
  • Italian: https://huggingface.co/HUMADEX/italian_medical_ner
  • Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
  • Greek: https://huggingface.co/HUMADEX/greek_medical_ner
  • Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
  • Polish: https://huggingface.co/HUMADEX/polish_medical_ner
  • Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

Visit weakly-supervised-multi-lingual-ner-pipeline collection in HuggingFace Hub to see the models and datasets.

Acknowledgement:

This code had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

Please cite as:

Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1

Files

Files (122.1 kB)

Name Size Download all
md5:f4abb2432f28daa85d74973a6b5e2164
122.1 kB Download

Additional details

Funding

European Commission
SMILE - Supporting Mental Health in Young People: Integrated Methodology for cLinical dEcisions and evidence-based interventions 101080923
European Commission
BosomShield - A comprehensive CAD system based on radiologic- and pathologic-image biomarkers for diagnosis and prognosis of breast cancer relapse 101073222

Software

Repository URL
https://github.com/HUMADEX/Weekly-Supervised-NER-pipline
Programming language
Python
Development Status
Active