MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

Gasco, Luis; Krallinger, Martin; Antonio, Miranda

doi:10.5281/zenodo.5602914

Published March 17, 2021 | Version 1.0.6

Dataset Open

MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

1. Barcelona Supercomputing Center

Gold Standard annotations of the MESINESP2 corpora (training, development and test sets).

Please cite this paper if you use this dataset:

@inproceedings{gasco2021overview,
  title={Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials},
  author={Gasco, Luis and Nentidis, Anastasios and Krithara, Anastasia and Estrada-Zavala, Darryl and Murasaki, Renato Toshiyuki and Primo-Pe{\~n}a, Elena and Bojo Canales, Cristina and Paliouras, Georgios and Krallinger, Martin and others},
  year={2021},
  organization={CEUR Workshop Proceedings}
}

Introduction

The main aim of MESINESP2 is to promote the development of practically relevant semantic indexing tools for biomedical content in non-English language. We have generated a manually annotated corpus, where domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts. All the documents were labeled with DeCS descriptors, which is a structured controlled vocabulary created by BIREME to index scientific publications on BvSalud, the largest database of scientific documents in Spanish, which hosts records from the databases LILACS, MEDLINE, IBECS, among others.

MESINESP track at BioASQ9 explores the efficiency of systems for assigning DeCS to different types of biomedical documents. To that purpose, we have divided the task into three subtracks depending on the document type. Then, for each one we generated an annotated corpus which was provided to participating teams:

[Subtrack 1 corpus] MESINESP-L – Scientific Literature: It contains all Spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish.
[Subtrack 2 corpus] MESINESP-T- Clinical Trials contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API.
[Subtrack 3 corpus] MESINESP-P – Patents: This corpus includes patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”.

In addition, we also provide a set of complementary data such as: the DeCS terminology file, a silver standard with the participants' predictions to the task background set and the entities of medications, diseases, symptoms and medical procedures extracted from the BSC NERs documents.

Files structure:

Silver_Standard_Mesinesp2.zip contains two separate sections. On the one hand, the union of the labels of the best model of each participating team as long as this model had obtained at least an F-score of 0.2 (folder join). On the other hand, the predictions of the best models of each participant have been included individually and anonymized (folder separated). This silver standard contains a set of 8642 scientific articles, 1537 text sections from Clinical Practice Guidelines, a set of 8458 text segments from Medication Data Sheets, 461 clinical trials from REEC and 5170 patents.

Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:

Subtrack1:
- Train:
  - training_set_track1_all.json: Full training set for subtrack 1.
  - training_set_track1_only_articles.json: Articles training set for subtrack 1.
- Development
  - development_set_subtrack1.json:
- Test
  - test_set_subtrack1.json: Test set for subtrack 1.

Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:

Subtrack2:
- Train
  - training_set_subtrack2.json: Training set for subtrack 2.
- Development
  - development_set_subtrack2.json: Manually annotated development set for subtrack 2.
- Test
  - test_set_subtrack2.json: Test set for subtrack 2.

Subtrack3-Patents.zip contains the corpora generated for subtrack 3. Content:

Subtrack3:
- Development
  - development_set_subtrack3.json: Manually annotated development set for subtrack 3.
- Test
  - test_set_subtrack3.json: Test set for subtrack 3.

Additional data.zip contains the corpora with additional data for each subtrack of MESINESP2.

DeCS2020.tsv contains a DeCS table with the following structure:

DeCS code
Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)
List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.

DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.

Data format description

The input text files for the MESINESP track are JSON files with the following structure:

{
  "articles": [
    {
      "id": "ibc-FGT-907", 
      "title": "Metas de control de la presión arterial e impacto sobre desenlaces cardiovasculares 
               en pacientes con diabetes mellitus tipo 2: un análisis crítico de la literatura", 
      "abstractText": "La hipertensión arterial en individuos con diabetes mellitus tipo2 incrementa 
                      el riesgo de eventos cardiovasculares. Las guías internacionales de manejo                                             
                      recomiendan iniciar tratamiento farmacológico con valores de presión arterial 
                      >140/90mmHg Sin embargo, no existe un punto de corte óptimo a partir del cual 
                      se logre reducir los eventos cardiovasculares sin originar eventos adversos; 
                      un rango de presión arterial >130/80 y <140/90mmHg parece ser el adecuado. 
                      Estos valores pueden alcanzarse mediante intervenciones no farmacológicas 
                      (dieta, ejercicio) y farmacológicas (por fármacos que hayan demostrado reducir 
                      eventos cardiovasculares). La elección de uno o varios fármacos debe ser 
                      individualizada, de acuerdo con factores como etnia, edad, comorbilidades 
                      asociadas, entre otros", 
      "journal": "Clín. investig. arterioscler. (Ed. impr.)", 
      "year": 2019, 
      "db": "IBECS", 
      "decsCodes": [
        "D006973",
        "D000959",
        "D002318",
        "D003924",
        "D012307"
      ]
    }
  ]
}

MESINESP entity mention files contain automatically generated mention annotations of medications, diseases, syntoms and medical procedures with the following JSON format:

{
  "articles": [
    {
      "id": "ibc-FGT-907",
      "diseases": [
           {"span": "hipertensión arterial", "start": "3", "end": "24"},
           {"span": "diabetes mellitus tipo2", "start": "43", "end": "66"},
           {"span": "eventos cardiovasculares", "start": "91", "end": "115"}],
      "medications": [],
      "procedures": [],
      "symptoms": []}]
    }
  ]
}

Dataset description:
These corpora contain the data for each of the subtracks of MESINESP2 shared-task:

[Subtrack 1] MESINESP-L – Scientific Literature :
- Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
  - Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
  - Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
- Development set: We provided a development set manually indexed by our expert annotators (not VHL ones). This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
  - 213 articles were annotated by more than one annotator. We have selected de union between annotations.
  - 852 articles were annotated by only one of the three selected annotators with better performance.
- Test set: We provide a test set containing 491 abstracts from LILACS and IBECS. We used this subset to evaluate the participating systems.
[Subtrack 2] MESINESP-T- Clinical Trials:
- Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.
- Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
- Test set: The test dataset contains a collection of 248 items. We used this subset to evaluate the participating systems.
[Subtrack 3] MESINESP-P – Patents:
- Development set: We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.
- Test set: We provide a test set containing 119 records that correspond to a subset of patents published in Spanish with the IPC codes “A61P” and “A61K31”.Similarly to the development set, we selected these records based on semantic similarity to the MESINESP-L training set. We used this subset to evaluate the participating systems.
Additional data:
- We provide this information to the participants as additional data in the “Additional Data” folder. For each training, development, and test set there is an additional JSON file with the structure shown here. Each file contains entities related to medications, diseases, symptoms, and medical procedures extrated with the BSC NERs.

Summary statistics:

MESINESP Corpus statistics
MESINESP-L	Docs	DeCS	Unique DeCS	Tokens
Training	237574	1988684	22434	43106663
Development	1065	11283	3750	211420
Test	491	5398	2124	93645
Total	239130	2005365	22482	43411728
MESINESP-T
Training	3560	52257	3940	4133166
Development	147	2038	771	146791
Test	248	3271	905	267031
Total	3955	57566	4410	4546988
MESINESP-P
Development	109	1092	520	38564
Test	119	1176	629	9065
Total	228	2268	989	47629

General MESINESP Corpus statistics
MESINESP	Docs	DeCS	Unique DeCS	Tokens
MESINESP-L	239130	2005365	22482	43411728
MESINESP-T	3955	57566	4410	4546988
MESINESP-P	228	2268	989	47629
Total	243313	2065199	22641	48006345

Related resources:

For further information, please email us at luis.gasco@bsc.es

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

Additional data.zip

Files (335.7 MB)

Name	Size	Download all
Additional data.zip md5:d78b0bfe07dcce33e3a9452fee417032	33.8 MB	Preview Download
DeCS2020.obo md5:7d2e94715515a27564322d5fb3f09b74	21.4 MB	Download
DeCS2020.tsv md5:8c25bd99a5323bbc5678337eb551e4b8	8.1 MB	Download
Silver_Standard_Mesinesp2.zip md5:cccf81d817ff3a66c6a22e18a535e42f	41.0 MB	Preview Download
Subtrack1-Scientific_Literature.zip md5:1efac9fe52246e858521f77f34877182	222.4 MB	Preview Download
Subtrack2-Clinical_Trials.zip md5:f94c1a124285f56374e89c698d02e162	9.0 MB	Preview Download
Subtrack3-Patents.zip md5:346d341caf281268a951f76377e5dae0	74.1 kB	Preview Download

	All versions	This version
Views	4,549	1,091
Downloads	1,862	1,221
Data volume	326.5 GB	239.8 GB

MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

Creators

Description

Notes

Files

Additional data.zip

Files (335.7 MB)