MultiCardioNER Corpus: Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain

Lima-López, Salvador; Farré-Maduell, Eulàlia; Rodríguez-Miret, Jan; Krallinger, Martin

doi:10.5281/zenodo.11368861

Published May 28, 2024 | Version v3

Dataset Open

MultiCardioNER Corpus: Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain

1. Barcelona Supercomputing Center

MultiCardioNER

MultiCardioNER is a shared task about the adaptation of clinical NER systems to the cardiology domain. It uses a combination of two existing datasets (DisTEMIST for diseases and the newly-released DrugTEMIST for medications), as well as a new, smaller dataset of cardiology clinical cases annotated using the same guidelines.

Participants are provided DisTEMIST and DrugTEMIST as training data to use as they see fit (1,000 documents, with the original partitions splitting them into 750 for training and 250 for testing). The cardiology clinical cases (cardioccc) are meant to be used as a development or validation set (258 documents), although participants are encourage to experiment with the documents and annotations as they see fit. The evaluation is done using a different collection of cardiology clinical cases (250).

MultiCardioNER proposes two tracks:

- Track 1: Spanish adaptation of disease recognition systems to the cardiology domain.
- Track 2: Multilingual (Spanish, English and Italian) adaptation of medication recognition systems to the cardiology domain.

Please read the README file attached for more information on folder structure and file format.

MultiCardioNER was developed by the Barcelona Supercomputing Center's NLP for Biomedical Information Analysis and used as part of BioASQ 2024. For more information on the corpus, annotation scheme and task in general, please visit: https://temu.bsc.es/multicardioner. This task is promoted by Spanish and European projects such as DataTools4Heart, AI4HF, BARITONE and AI4ProfHealth.

UPDATE MAY 28th 2024: The test set annotations are now out! We've also included the original background set files, as well as a file with the mappings from the masked filenames used during the evaluation phase to the original filenames. Please check the README for more information.

Resources

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Additional resources and corpora

If you are interested in MultiCardioNER, you might want to check out these corpora and resources:

DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT)
MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT)
SympTEMIST (Corpus of clinical findings and normalization to SNOMED CT)
PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization)
MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization)
MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization)
MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI))
CANTEMIST (Corpus of cancer tumor morphology mentions and normalization)
CodiESP (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version)
LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy)
SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags)
SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries))
SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags)
MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts)

Files

multicardioner_train+dev+test+bg+mappings_240528.zip

Files (98.4 MB)

Name	Size	Download all
multicardioner_train+dev+test+bg+mappings_240528.zip md5:56055c807e2cc329ff5eeaf7e4f713ba	98.4 MB	Preview Download

	All versions	This version
Views	1,452	656
Downloads	281	123
Data volume	18.0 GB	13.0 GB

MultiCardioNER Corpus: Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain

Creators

Description

MultiCardioNER

Resources

License

Contact

Additional resources and corpora

Files

multicardioner_train+dev+test+bg+mappings_240528.zip

Files (98.4 MB)