SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction

Salvador Lima López; Luis Gascó Sánchez; Eulalia Farré; Laura Vigil Gimenez; Martin Krallinger

doi:10.5281/zenodo.8413866

Published October 6, 2023 | Version 2.2

Dataset Open

SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction

1. Barcelona Supercomputing Center

SympTEMIST stands for Symptoms TExt MIning Shared Task. It is a shared task and set of resources focused on the detection of mentions, normalization and indexing of symptoms, signs and findings in medical documents in Spanish. SympTEMIST is complementary to the DisTEMIST (https://temu.bsc.es/distemist) and MedProcNER/ProcTEMIST (https://temu.bsc.es/medprocner) corpora as they all use the same document collection.

Please cite if you use this dataset:

"Lima-López, S., Farré-Maduell, E., Gasco-Sánchez, L., Rodríguez-Miret, J. and Krallinger, M. (2023). Overview of SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models."

This repository includes the:

Train and Test Set for the three subtasks
SYMPTEMIST gazetteer of SNOMED symptoms, signs & findings
Multilingual Silver Standard in 9 languages:
- English
- Portuguese
- French
- Italian
- Romanian
- Catalan
- Swedish
- Dutch
- Czech
Background set of over 15,000 clinical cases.

Please read the README file attached for more information on folder structure and file format.

SympTEMIST was developed by the Barcelona Supercomputing Center's NLP for Biomedical Information Analysis and used as part of BioCREATIVE 2023. For more information on the corpus, annotation scheme and task in general, please visit: https://temu.bsc.es/symptemist.

Resources:

Task web
BioCreative web
Citation: Lima-López, S., Farré-Maduell, E., Gasco-Sánchez, L., Rodríguez-Miret, J. and Krallinger, M. (2023). Overview of SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.
Annotation guidelines
BioCreative/AMIA workshop proceedings
Overview paper
Youtube videos (overview & teams)
SympTEMIST overview talk slides at BioCreative/AMIA workshop

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Additional resources and corpora

If you are interested in SympTEMIST, you might want to check out these corpora and resources:

DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT, same document collection)
MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT, same document collection)
PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization, same document collection)
MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization, different document collection with some overlapping documents)
MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization, different document collection with some overlapping documents)
MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI), modified synthetic verions of the document collection)
CANTEMIST (Corpus of cancer tumor morphology mentions and normalization, different document collection)
CodiESp (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version, same document collection)
LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy, different document collection with some overlapping documents)
SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags, same document collection)
SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries), same document collection)
SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags, same document collection)
MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection)

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

symptemist-train_all_subtasks+gazetteer+multilingual+test_all_subtasks+bg_231006.zip

Files (52.6 MB)

Name	Size	Download all
symptemist-train_all_subtasks+gazetteer+multilingual+test_all_subtasks+bg_231006.zip md5:28bd316f4ebdeaaa66ed237b20456567	52.6 MB	Preview Download

	All versions	This version
Views	4,473	642
Downloads	1,112	183
Data volume	285.4 GB	9.9 GB

SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction

Authors/Creators

Description

Notes

Files

symptemist-train_all_subtasks+gazetteer+multilingual+test_all_subtasks+bg_231006.zip

Files (52.6 MB)