Dataset Open Access

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Krallinger, Martin; Rabal, Obdulia; Miranda-Escalada, Antonio; Valencia, Alfonso


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/ecaec607-d38a-485e-8a3b-cc5abcd32ac6/drugprot-training-development-test-background.zip"
      }, 
      "checksum": "md5:c706ebf04580c3126154d96a43e64f2e", 
      "bucket": "ecaec607-d38a-485e-8a3b-cc5abcd32ac6", 
      "key": "drugprot-training-development-test-background.zip", 
      "type": "zip", 
      "size": 13418218
    }
  ], 
  "owners": [
    55928
  ], 
  "doi": "10.5281/zenodo.5119892", 
  "stats": {
    "version_unique_downloads": 461.0, 
    "unique_views": 615.0, 
    "views": 697.0, 
    "version_views": 3014.0, 
    "unique_downloads": 180.0, 
    "version_unique_views": 2319.0, 
    "volume": 2777571126.0, 
    "version_downloads": 540.0, 
    "downloads": 207.0, 
    "version_volume": 3952663830.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.5119892", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.4955410", 
    "bucket": "https://zenodo.org/api/files/ecaec607-d38a-485e-8a3b-cc5abcd32ac6", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.4955410.svg", 
    "html": "https://zenodo.org/record/5119892", 
    "latest_html": "https://zenodo.org/record/5119892", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.5119892.svg", 
    "latest": "https://zenodo.org/api/records/5119892"
  }, 
  "conceptdoi": "10.5281/zenodo.4955410", 
  "created": "2021-07-21T16:05:20.094546+00:00", 
  "updated": "2021-07-22T13:48:21.414491+00:00", 
  "conceptrecid": "4955410", 
  "revision": 2, 
  "id": 5119892, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.5119892", 
    "description": "<p>Gold Standard annotations of the DrugProt corpus (training and development sets). Also, test and background sets.</p>\n\n<p><br>\n&nbsp;</p>\n\n<p><strong>Introduction</strong></p>\n\n<p>The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins. We have therefore generated a manually annotated corpus, the&nbsp;<em>DrugProt corpus</em>, where domain experts have exhaustively labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (<em>DrugProt relation classes</em>). There is also an increasing interested in the integration of chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research. A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.</p>\n\n<p>The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.</p>\n\n<p>The DrugProt track in BioCreative VII (BC VII) will explore recognition of chemical-protein entity relations from abstracts.</p>\n\n<p>Teams participating in this track are provided with:</p>\n\n<ul>\n\t<li>PubMed abstracts</li>\n\t<li>Manually annotated chemical compound mentions</li>\n\t<li>Manually annotated gene/protein mentions</li>\n\t<li>Manually annotated chemical compound-protein relations</li>\n</ul>\n\n<p>&nbsp;</p>\n\n<p><strong>Zip structure:</strong></p>\n\n<ul>\n\t<li>Training set folder with\n\t<ul>\n\t\t<li>drugprot_training_abstracts.tsv:&nbsp;PubMed records</li>\n\t\t<li>drugprot_training_entities.tsv:&nbsp;manually labeled mention annotations of chemical compounds and genes/proteins</li>\n\t\t<li>drugprot_training_relations.tsv: chemical-&shy;protein relation annotations</li>\n\t</ul>\n\t</li>\n\t<li>Development set folder with\n\t<ul>\n\t\t<li>drugprot_development_abstracts.tsv</li>\n\t\t<li>drugprot_development_entities.tsv</li>\n\t\t<li>drugprot_development_relations.tsv</li>\n\t</ul>\n\t</li>\n</ul>\n\n<ul>\n\t<li>Test+background set folder with\n\t<ul>\n\t\t<li>test_background_abstracts.tsv</li>\n\t\t<li>test_background_entities.tsv</li>\n\t</ul>\n\t</li>\n</ul>\n\n<p>&nbsp;</p>\n\n<p><strong>Data format&nbsp;description</strong></p>\n\n<p>The <strong>input text files</strong> for the DrugProt track are plain-text, UTF8-encoded PubMed records in a tab-separated format with the following three columns:</p>\n\n<ol>\n\t<li>Article identifier (PMID, PubMed identifier)</li>\n\t<li>Title of the article</li>\n\t<li>Abstract of the article</li>\n</ol>\n\n<p>&nbsp;</p>\n\n<p>DrugProt <strong>entity mention annotation files</strong>&nbsp;contain manually labeled mention annotations of chemical compounds and genes/proteins. Such files consist of tab-separated fields containing the following six columns:</p>\n\n<ol>\n\t<li>Article identifier (PMID)</li>\n\t<li>Term number (for this record)</li>\n\t<li>Type of entity mention (CHEMICAL, GENE-Y, GENE-N)</li>\n\t<li>Start character offset of the entity mention</li>\n\t<li>End character offset of the entity mention</li>\n\t<li>Text string of the entity mention</li>\n</ol>\n\n<p>Each line contains one entity, and <em>each entity is uniquely identified by its PMID and the Term Number</em>. Besides, each annotation contains an annotation type, the start-offset -the index of the first character of the annotated span in the text-, the end-offset -the index of the first character after the annotated span- and the text spanned&nbsp;by the annotation.</p>\n\n<p>Example DrugProt <em>training</em> entity mention annotations:</p>\n\n<pre><code>11808879\tT1\tGENE-Y\t1860\t1866\tKIR6.2\n11808879\tT2\tGENE-N\t1993\t2016\tglutamate dehydrogenase\n11808879\tT3\tGENE-Y\t2242\t2253\tglucokinase\n23017395\tT1\tCHEMICAL\t216\t223\tHMG-CoA\n23017395\tT2\tCHEMICAL\t258\t261\tEPA</code></pre>\n\n<p>&nbsp;</p>\n\n<p>Example DrugProt <em>development</em> entity mention annotations (no distinction between GENE-Y and GENE-N):</p>\n\n<pre><code>11808879\tT1\tGENE\t1860\t1866\tKIR6.2\n11808879\tT2\tGENE\t1993\t2016\tglutamate dehydrogenase\n11808879\tT3\tGENE\t2242\t2253\tglucokinase\n23017395\tT1\tCHEMICAL\t216\t223\tHMG-CoA\n23017395\tT2\tCHEMICAL\t258\t261\tEPA</code></pre>\n\n<p><br>\nDrugProt <strong>relation annotations</strong> are distributed as a file that contains the detailed chemical-protein relation annotations prepared for the DrugProt track. There are no relation annotations for the test+background set (the goal of the task is to predict them). It consists of tab-separated columns containing:</p>\n\n<ol>\n\t<li>Article identifier (PMID)</li>\n\t<li>DrugProt relation</li>\n\t<li>Interactor argument 1 (<em>of type CHEMICAL</em>)</li>\n\t<li>Interactor argument 2 (<em>of type GENE</em>)</li>\n</ol>\n\n<p>Each line contains one relation, and <em>each relation is identified by the PMID, the relation type and the two related entities</em>. In the below example, to find the entities involved in the first relation, you must find the entities with Term Identifier T1 and T52 <em>within the PMID&nbsp;12488248.</em></p>\n\n<p>Example DrugProt relation&nbsp;annotations:</p>\n\n<pre><code>12488248\tINHIBITOR\tArg1:T1\tArg2:T52\n12488248\tINHIBITOR\tArg1:T2\tArg2:T52\n23220562\tACTIVATOR\tArg1:T12\tArg2:T42\n23220562\tACTIVATOR\tArg1:T12\tArg2:T43\n23220562\tINDIRECT-DOWNREGULATOR\tArg1:T1\tArg2:T14</code></pre>\n\n<p>&nbsp;</p>\n\n<p>Please, cite:</p>\n\n<p>@inproceedings{krallinger2017overview,&nbsp;title={Overview of the BioCreative VI chemical-protein interaction Track},&nbsp;author={Krallinger, Martin and Rabal, Obdulia and Akhondi, Saber A and P{\\&#39;e}rez, Mart{\\i}n P{\\&#39;e}rez and Santamar{\\&#39;\\i}a, Jes{\\&#39;u}s and Rodr{\\&#39;\\i}guez, Gael P{\\&#39;e}rez and others},&nbsp;booktitle={Proceedings of the sixth BioCreative challenge evaluation workshop},&nbsp;volume={1},&nbsp;pages={141--146},&nbsp;year={2017}}</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Summary statistics:</strong></p>\n\n<pre><code>\t\t\tTraining set\tDevelopment set\nDocuments\t\t3500\t\t750\nTokens\t\t\t1001168\t\t199620\nAnnotated Entities\t89529\t\t18858\nAnnotated Relations\t17288\t\t3765</code></pre>\n\n<p>&nbsp;</p>\n\n<p>Annotated Entities:</p>\n\n<pre><code class=\"language-html\">\t\t\t\tTraining Entities\tDevelopment Entities\nCHEMICAL\t\t\t46274\t\t\t9853\nGENE-Y [Normalizable]\t\t28421\t\t\t-\nGENE-N [Non-Normalizable]\t14834\t\t\t-\nGene Total (N+Y)\t\t43255\t\t\t9005\nTotal\t\t\t\t89529\t\t\t18858</code></pre>\n\n<p>&nbsp;</p>\n\n<p>Annotated Relations:</p>\n\n<pre><code>\t\t\tTraining Relations\tDevelopment Relations\nINDIRECT-DOWNREGULATOR\t1330\t\t\t332\nINDIRECT-UPREGULATOR\t1379\t\t\t302\nDIRECT-REGULATOR\t2250\t\t\t458\nACTIVATOR\t\t1429\t\t\t246\nINHIBITOR\t\t5392\t\t\t1152\nAGONIST\t\t\t659\t\t\t131\nAGONIST-ACTIVATOR\t29\t\t\t10\nAGONIST-INHIBITOR\t13\t\t\t2\nANTAGONIST\t\t972\t\t\t218\nPRODUCT-OF\t\t921\t\t\t158\nSUBSTRATE\t\t2003\t\t\t495\nSUBSTRATE_PRODUCT-OF\t25\t\t\t3\nPART-OF\t\t\t886\t\t\t258\nTotal \t\t\t17288\t\t\t3765</code></pre>\n\n<p>&nbsp;</p>\n\n<p>For further information, please visit&nbsp;<a href=\"https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/\">https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/</a> or email us at krallinger.martin@gmail.com and antoniomiresc@gmail.com</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Related resources:</strong></p>\n\n<ul>\n\t<li><a href=\"https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/\">Web</a></li>\n\t<li><a href=\"https://github.com/tonifuc3m/drugprot-evaluation-library\">Evaluation library</a></li>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.4957137\">Relation annotation guidelines</a></li>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.4957576\">Gene and protein annotation guidelines</a></li>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.4957518\">Chemicals and drugs annotation guidelines</a></li>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.5042178\">FAQ</a></li>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.5119878\">DrugProt Large Scale Additional SubTrack</a></li>\n</ul>", 
    "language": "eng", 
    "title": "DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions", 
    "license": {
      "id": "CC-BY-4.0"
    }, 
    "relations": {
      "version": [
        {
          "count": 3, 
          "index": 2, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "4955410"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "5119892"
          }
        }
      ]
    }, 
    "communities": [
      {
        "id": "medicalnlp"
      }
    ], 
    "version": "1.2", 
    "keywords": [
      "NLP", 
      "relation extraction", 
      "NER", 
      "biomedical NLP", 
      "biocreative"
    ], 
    "publication_date": "2021-06-29", 
    "creators": [
      {
        "orcid": "0000-0002-2646-8782", 
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Krallinger, Martin"
      }, 
      {
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Rabal, Obdulia"
      }, 
      {
        "orcid": "0000-0002-5654-001X", 
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Miranda-Escalada, Antonio"
      }, 
      {
        "orcid": "0000-0002-8937-6789", 
        "affiliation": "Barcelona Supercomputing Center", 
        "name": "Valencia, Alfonso"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.4955410", 
        "relation": "isVersionOf"
      }
    ]
  }
}
3,014
540
views
downloads
All versions This version
Views 3,014697
Downloads 540207
Data volume 4.0 GB2.8 GB
Unique views 2,319615
Unique downloads 461180

Share

Cite as