Published February 22, 2021 | Version 1
Dataset Open

CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish

  • 1. CSIC (Spanish National Research Council)
  • 2. Medical Terminology Unit, Spanish Royal Academy of Medicine
  • 3. Centro de Salud Mental Retiro (Madrid)
  • 4. Computational Linguistics Laboratory, Universidad Autónoma de Madrid

Description

A collection of 1200 texts (292 173 tokens) about clinical trials studies and clinical trials announcements in Spanish:

- 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO).
- 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos.

Texts were annotated with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). 46 699 entities were annotated (13.98% are nested entities). 10% of the corpus was doubly annotated, and inter-annotator agreement (IAA) achieved a mean F-measure of 85.65% (±4.79, strict match) and a mean F-measure of 93.94% (±3.31, relaxed match). 

The corpus is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License.

Files

CT-EBM-SP.zip

Files (2.6 MB)

Name Size Download all
md5:5c0a3c52427c67eb62804a489590a90f
2.6 MB Preview Download

Additional details

Funding

InterTalentum – Programme for Post-Doctoral Talent Attraction to CEI UAM+CSIC 713366
European Commission