Dataset Open Access

MESINESP: Medical Semantic Indexing in Spanish - Development dataset

Martin Krallinger; Aitor Gonzalez-Agirre; Alejandro Asensio


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/be9a0f75-a6ac-43a5-b274-c48ab42ab3cd/mesinesp-development-set.zip"
      }, 
      "checksum": "md5:58da931670a51b078cc5e193aa9d91e1", 
      "bucket": "be9a0f75-a6ac-43a5-b274-c48ab42ab3cd", 
      "key": "mesinesp-development-set.zip", 
      "type": "zip", 
      "size": 1027643
    }
  ], 
  "owners": [
    55928
  ], 
  "doi": "10.5281/zenodo.3746596", 
  "stats": {
    "version_unique_downloads": 44.0, 
    "unique_views": 244.0, 
    "views": 284.0, 
    "version_views": 284.0, 
    "unique_downloads": 44.0, 
    "version_unique_views": 244.0, 
    "volume": 48299221.0, 
    "version_downloads": 47.0, 
    "downloads": 47.0, 
    "version_volume": 48299221.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.3746596", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.3746595", 
    "bucket": "https://zenodo.org/api/files/be9a0f75-a6ac-43a5-b274-c48ab42ab3cd", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.3746595.svg", 
    "html": "https://zenodo.org/record/3746596", 
    "latest_html": "https://zenodo.org/record/3746596", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.3746596.svg", 
    "latest": "https://zenodo.org/api/records/3746596"
  }, 
  "conceptdoi": "10.5281/zenodo.3746595", 
  "created": "2020-04-09T17:07:22.779528+00:00", 
  "updated": "2020-06-10T12:50:28.004389+00:00", 
  "conceptrecid": "3746595", 
  "revision": 7, 
  "id": 3746596, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.3746596", 
    "description": "<p><strong>Introduction</strong></p>\n\n<p>The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) development set has a total of 750 records indexed manually by seven experienced medical literature indexers. Indexing is done using <em>DeCS codes, a sort of Spanish equivalent to MeSH terms</em>. Records were distributed in a way that each article was annotated, at least, by two different human indexers.</p>\n\n<p>The data annotation process consisted in two steps:</p>\n\n<ol>\n\t<li>Manual indexing step. DeCS codes were manually assigned to each record following the DeCS manual indexing guidelines.</li>\n\t<li>Manual validation and consensus. The joined set of manually indexed DeCS codes generated by both indexers were manually revised and corrections were done.</li>\n</ol>\n\n<p>These annotations were analyzed, resulting in an agreement using the Jaccard index.</p>\n\n<p>Records consisted basically in medical literature abstracts and titles from the IBECS and LILACS databases.</p>\n\n<p><strong>Zip structure</strong><br>\nThe zip file contains two different development sets:</p>\n\n<ul>\n\t<li><em>Official development set</em>, which has the union of the annotations, with an agreement of macro = 0.6568 and micro = 0.6819. This set is composed by all the different (unique) DeCS codes that have been added by any annotator for each document; and</li>\n\t<li><em>Core-descriptors development set</em>, which has the intersection of the annotations, with an agreement of macro = 1.0 and micro = 1.0. This set is composed of the common DeCS codes that have been added by two or more annotators for each document.</li>\n</ul>\n\n<p><strong>Corpus format</strong></p>\n\n<p>Each dataset is a JSON object with one single key named &quot;articles&quot;, which contains a list of documents. So, the raw format of the file is one line per document plus two additional lines (the first and the last) to enclose that list of documents and the expected type of data is as follows:</p>\n\n<pre><code class=\"language-json\">{\"articles\":[\n{\"abstractText\":str,\"db\":str,\"decsCodes\":list,\"id\":str,\"journal\":str,\"title\":str,\"year\":int},\n...\n]}</code></pre>\n\n<p>To clarify, the order of appearance of the fields in each document is as follows (note that this example it is pretty printed for readability purposes):</p>\n\n<pre><code class=\"language-json\">{\n  \"articles\": [\n    {\n      \"abstractText\": \"Content of the abstract\",\n      \"db\": \"Name of the source database\",\n      \"decsCodes\": [\n        \"code1\",\n        \"code2\",\n        \"code3\"\n      ],\n      \"id\": \"Id of the document\",\n      \"journal\": \"Name of the journal\",\n      \"title\": \"Title of the document\",\n      \"year\": 2019\n    }\n  ]\n}</code></pre>\n\n<p>Note: The fields &quot;db&quot;, &quot;journal&quot; and &quot;year&quot; might&nbsp;be null.</p>", 
    "language": "spa", 
    "title": "MESINESP: Medical Semantic Indexing in Spanish - Development dataset", 
    "license": {
      "id": "CC-BY-4.0"
    }, 
    "notes": "Funded by the Plan de Impulso de las Tecnolog\u00edas del Lenguaje (Plan TL).", 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "3746595"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "3746596"
          }
        }
      ]
    }, 
    "communities": [
      {
        "id": "medicalnlp"
      }
    ], 
    "version": "1.0.0", 
    "references": [
      "Krallinger M, Krithara A, Nentidis A, Paliouras G, Villegas M. BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering. InEuropean Conference on Information Retrieval 2020 Apr 14 (pp. 550-556). Springer, Cham."
    ], 
    "keywords": [
      "indexing", 
      "decs"
    ], 
    "publication_date": "2020-04-09", 
    "creators": [
      {
        "orcid": "0000-0002-2646-8782", 
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Martin Krallinger"
      }, 
      {
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Aitor Gonzalez-Agirre"
      }, 
      {
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Alejandro Asensio"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.3746595", 
        "relation": "isVersionOf"
      }
    ]
  }
}
284
47
views
downloads
All versions This version
Views 284284
Downloads 4747
Data volume 48.3 MB48.3 MB
Unique views 244244
Unique downloads 4444

Share

Cite as