Published December 29, 2025 | Version v4
Dataset Open

CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish (version 3)

  • 1. ROR icon Consejo Superior de Investigaciones Científicas
  • 2. Spanish Royal Academy of Medicine
  • 3. ROR icon Hospital General Universitario Gregorio Marañón
  • 4. ROR icon Universitat de València
  • 5. Hospital Regional Universitario Carlos Haya
  • 6. ROR icon Universidad de La Rioja

Description

A collection of 1200 texts (292173 tokens) about clinical trials studies and clinical trials announcements in Spanish:

- 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO).
- 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Español de Estudios Clínicos.

Texts were annotated with the following entities types:

- Semantic groups from the Unified Medical Language System: ANAT, CHEM, DEVI, DISO, LIVB, PHYS and PROC.
- Medical drug information: Contraindicated, Dose_or_Strength, Form, and Route_or_Mode_of_administration.
- Temporal expressions: Age, Date, Duration_or_Interval, Frequency and Time.
- Miscellaneous medical entities: Concept, Food_or_Drink, Observation_or_Finding, Quantifier_or_Qualifier, and Result_or_Value.
- Negation/Speculation: Neg_cue, Negated, Spec_cue and Speculated.
- Attributes of temporality (Future, Family_history_of, and History_of), experiencer (Patient, Family_member and Other) and other information (Hypothetical).

In addition, the following semantic relationships were annotated: 

- Intervention-related relations
    • Has_Dose_or_Strength
    • Has_Drug_Form
    • Has_Route_or_Mode
    • Combined_with
    • Used_for
    • Has_Result_or_Value
- Temporal relations
    • Before
    • After
    • Overlap
    • Has_Age 
    • Has_Frequency
    • Has_Duration_or_Interval
- Event-related relations
    • Causes
    • Experiences
    • Has_Quantifier_or_Qualifier
    • Location_of
- Assertion relations
    • Negation
    • Speculation

81.75% of the total entities were normalized to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs).

This is the final version with the corrections made after each file was reviewed by a a second reviewer.

Two annotators reviewed each corpus file.

Relation extraction Python code is available at the companion GitHub repository: https://github.com/lcampillos/ct-ebm-sp-v3

Files

CT-EBM-SP-v3.zip

Files (21.8 MB)

Name Size Download all
md5:cec1b6e9650bcda5d5bbeec1098b23e7
21.8 MB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.13880599 (DOI)

Funding

Agencia Estatal de Investigación
CLARA-MeD, funded by MICIU/AEI/10.13039/501100011033 in project call "Proyectos I+D+i Retos Investigación" PID2020-116001RA-C33
Consejo Superior de Investigaciones Científicas
JAE Intro 2021
Agencia Estatal de Investigación
ExPlain4Health project, funded by MICIU/AEI/10.13039/414501100011033 PID2024-158912NB-I00

Dates

Available
2025-12-29

Software

Repository URL
https://github.com/lcampillos/ct-ebm-sp-v3/
Programming language
Python
Development Status
Active