Published May 14, 2020 | Version 1.0
Dataset Open

MESINESP: Medical Semantic Indexing in Spanish - Train dataset

  • 1. Barcelona Supercomputing Center

Description

Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).

 

 

INTRODUCTION:

The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) training set has a total of 369,368 records. 

The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows:
http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

We have filtered out empty abstracts and non-Spanish abstracts. 

The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.

 

ZIP STRUCTURE:

The training data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:

 - Original Train set with 369,368 records that also include the qualifiers, as retrieved from VHL. 
 - Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers. 

 

 

STATISTICS:

Abstracts’ length (measured in characters)
Min: 12
Avg: 1140.41
Median: 1094
Max: 9428

Number of DeCS codes per file
Min: 1
Avg: 8.12
Median: 7
Max: 53

 

 

CORPUS FORMAT:

The training data sets are distributed as a JSON file with the following format:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database",
      "decsCodes": [
        "code1",
        "code2",
        "code3"
      ]
    }
  ]
}

Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table (https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.tsv.zip) with:

 - DeCs codes
 - Preferred descriptor (the label used in the European DeCs 2019 set)
 - List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

 

For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively.

Please, cite: Krallinger M, Krithara A, Nentidis A, Paliouras G, Villegas M. BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering. InEuropean Conference on Information Retrieval 2020 Apr 14 (pp. 550-556). Springer, Cham.

 

Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

MESINESP-training.zip

Files (308.8 MB)

Name Size Download all
md5:0f11cbdb48e9e406086d0a633c413db0
308.8 MB Preview Download