{
  "DOI": "10.5281/zenodo.4612275",
  "abstract": "Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September)\u00a0http://clef2021.clef-initiative.eu/\u00a0\n\n\nIntroduction:\nThese corpora contain the data for each of the sub-tracks of MESINESP2 shared-task:\n\n\n\n\t\nTrack 1- Medical indexing: \u00a0\n\n\t\n\n\t\t\nTraining set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish.\u00a0We have filtered out empty abstracts and non-Spanish abstracts.\u00a0\u00a0We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database.\u00a0We distribute two different datasets:\n\n\t\t\n\n\t\t\t\nArticles training set:\u00a0This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.\n\t\t\t\nFull training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.\n\t\t\n\t\t\n\t\t\nDevelopment set:\u00a0We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:\n\t\t\n\n\t\t\t\n213 articles were annotated by more than one annotator. We have selected de union between annotations.\n\t\t\t\n852 articles were annotated by only one of the three selected annotators with better performance.\n\t\t\n\t\t\n\t\t\nTest set:\u00a0To be published\u00a0\n\t\n\t\n\t\nTrack 2- Clinical trials: \u00a0\n\t\n\n\t\t\nTraining set:\u00a0The training dataset contains records from\u00a0Registro Espa\u00f1ol de Estudios Cl\u00ednicos (REEC). REEC doesn't\u00a0provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC\u00a0API.\u00a0Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3592 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a\u00a0Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.30, which corresponds with the submission of the best three teams. We have selected the union of all codes assigned by those team.\n\t\t\nDevelopment set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.\n\t\n\t\n\t\nTrack 3- Patents:\u00a0To be published\n\n\n\nFiles structure:\n\n\nMESINESP2_corpus.zip contains the corpora generated for the shared task. Content:\n\n\n\n\t\nSubtrack1:\n\t\n\n\t\t\nTrain\n\t\t\n\n\t\t\t\ntraining_set_track1_all.json: Full training set for sub-track 1.\n\t\t\t\ntraining_set_track1_only_articles.json:\u00a0Articles training set for sub-track 1.\n\t\t\n\t\t\n\t\t\nTest\n\t\t\n\n\t\t\t\ndevelopment_set_subtrack1.json: Manually annotated\u00a0development set for sub-track 1.\n\t\t\n\t\t\n\t\n\t\n\t\nSubtrack2:\n\t\n\n\t\t\nTrain\n\t\t\n\n\t\t\t\ntraining_set_subtrack2.json: Training set for sub-track 2.\n\t\t\n\t\t\n\t\t\nTest\n\t\t\n\n\t\t\t\ndevelopment_set_subtrack2.json:\u00a0Manually annotated\u00a0development set for sub-track 2.\n\t\t\n\t\t\n\t\n\t\n\t\nSubtrack3: This folder is empty. Data for sub-track\u00a03 will be published soon.\n\n\n\n\u00a0\n\n\nDeCS2020.tsv contains a DeCS table with the following structure:\n\n\n\n\t\nDeCS code\n\t\nPreferred descriptor (the preferred label in the Latin Spanish Decs\u00a02020 set)\n\t\nList of synonyms (the descriptors and synonyms from\u00a0Latin Spanish DeCS 2020, separate by pipes)\n\n\n\n\u00a0\n\n\nDeCS2020.obo\u00a0contains the *.obo file with the hierarchical relationships between DeCS descriptors.\n\n\n\u00a0\n\n\n\u00a0\n\n\nFor further information, please visit\u00a0https://temu.bsc.es/mesinesp2/\u00a0or email us at encargo-pln-life@bsc.es",
  "author": [
    {
      "family": "Gasco",
      "given": "Luis"
    },
    {
      "family": "Krallinger",
      "given": "Martin"
    }
  ],
  "id": "4612275",
  "issued": {
    "date-parts": [
      [
        "2021",
        "03",
        "17"
      ]
    ]
  },
  "language": "spa",
  "publisher": "Zenodo",
  "title": "MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish",
  "type": "dataset",
  "version": "1.0.0"
}