DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

doi:10.5281/zenodo.5119892

Published June 29, 2021 | Version 1.2

Dataset Open

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

1. Barcelona Supercomputing Center

Gold Standard annotations of the DrugProt corpus (training and development sets). Also, test and background sets.

Please cite if you use any DrugProt resource:

Antonio Miranda-Escalada, Farrokh Mehryary, Jouni Luoma, Darryl Estrada-Zavala, Luis Gasco, Sampo Pyysalo, Alfonso Valencia, Martin Krallinger, Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations, Database, Volume 2023, 2023, baad080

@article{miranda2023overview, title={Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical--protein relations}, author={Miranda-Escalada, Antonio and Mehryary, Farrokh and Luoma, Jouni and Estrada-Zavala, Darryl and Gasco, Luis and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin}, journal={Database}, volume={2023}, pages={baad080}, year={2023}, publisher={Oxford University Press UK} }

Miranda, Antonio, et al. "Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations." Proceedings of the seventh BioCreative challenge evaluation workshop. 2021.

@inproceedings{miranda2021overview, title={Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations}, author={Miranda, Antonio and Mehryary, Farrokh and Luoma, Jouni and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin}, booktitle={Proceedings of the seventh BioCreative challenge evaluation workshop}, year={2021} }

Introduction

The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins. We have therefore generated a manually annotated corpus, the DrugProt corpus, where domain experts have exhaustively labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (DrugProt relation classes). There is also an increasing interested in the integration of chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research. A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.

The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.

The DrugProt track in BioCreative VII (BC VII) will explore recognition of chemical-protein entity relations from abstracts.

Teams participating in this track are provided with:

PubMed abstracts
Manually annotated chemical compound mentions
Manually annotated gene/protein mentions
Manually annotated chemical compound-protein relations

Zip structure:

Training set folder with
- drugprot_training_abstracts.tsv: PubMed records
- drugprot_training_entities.tsv: manually labeled mention annotations of chemical compounds and genes/proteins
- drugprot_training_relations.tsv: chemical-protein relation annotations
Development set folder with
- drugprot_development_abstracts.tsv
- drugprot_development_entities.tsv
- drugprot_development_relations.tsv
Test+background set folder with
- test_background_abstracts.tsv
- test_background_entities.tsv

Data format description

The input text files for the DrugProt track are plain-text, UTF8-encoded PubMed records in a tab-separated format with the following three columns:

Article identifier (PMID, PubMed identifier)
Title of the article
Abstract of the article

DrugProt entity mention annotation files contain manually labeled mention annotations of chemical compounds and genes/proteins. Such files consist of tab-separated fields containing the following six columns:

Article identifier (PMID)
Term number (for this record)
Type of entity mention (CHEMICAL, GENE-Y, GENE-N)
Start character offset of the entity mention
End character offset of the entity mention
Text string of the entity mention

Each line contains one entity, and each entity is uniquely identified by its PMID and the Term Number. Besides, each annotation contains an annotation type, the start-offset -the index of the first character of the annotated span in the text-, the end-offset -the index of the first character after the annotated span- and the text spanned by the annotation.

Example DrugProt training entity mention annotations:

11808879 T1 GENE-Y 1860 1866 KIR6.2 11808879 T2 GENE-N 1993 2016 glutamate dehydrogenase 11808879 T3 GENE-Y 2242 2253 glucokinase 23017395 T1 CHEMICAL 216 223 HMG-CoA 23017395 T2 CHEMICAL 258 261 EPA

Example DrugProt development entity mention annotations (no distinction between GENE-Y and GENE-N):

11808879 T1 GENE 1860 1866 KIR6.2 11808879 T2 GENE 1993 2016 glutamate dehydrogenase 11808879 T3 GENE 2242 2253 glucokinase 23017395 T1 CHEMICAL 216 223 HMG-CoA 23017395 T2 CHEMICAL 258 261 EPA

DrugProt relation annotations are distributed as a file that contains the detailed chemical-protein relation annotations prepared for the DrugProt track. There are no relation annotations for the test+background set (the goal of the task is to predict them). It consists of tab-separated columns containing:

Article identifier (PMID)
DrugProt relation
Interactor argument 1 (of type CHEMICAL)
Interactor argument 2 (of type GENE)

Each line contains one relation, and each relation is identified by the PMID, the relation type and the two related entities. In the below example, to find the entities involved in the first relation, you must find the entities with Term Identifier T1 and T52 within the PMID 12488248.

Example DrugProt relation annotations:

12488248 INHIBITOR Arg1:T1 Arg2:T52 12488248 INHIBITOR Arg1:T2 Arg2:T52 23220562 ACTIVATOR Arg1:T12 Arg2:T42 23220562 ACTIVATOR Arg1:T12 Arg2:T43 23220562 INDIRECT-DOWNREGULATOR Arg1:T1 Arg2:T14

Please, cite:

@inproceedings{krallinger2017overview, title={Overview of the BioCreative VI chemical-protein interaction Track}, author={Krallinger, Martin and Rabal, Obdulia and Akhondi, Saber A and P{\'e}rez, Mart{\i}n P{\'e}rez and Santamar{\'\i}a, Jes{\'u}s and Rodr{\'\i}guez, Gael P{\'e}rez and others}, booktitle={Proceedings of the sixth BioCreative challenge evaluation workshop}, volume={1}, pages={141--146}, year={2017}}

Summary statistics:

Training set Development set Documents 3500 750 Tokens 1001168 199620 Annotated Entities 89529 18858 Annotated Relations 17288 3765

Annotated Entities:

Training Entities Development Entities CHEMICAL 46274 9853 GENE-Y [Normalizable] 28421 - GENE-N [Non-Normalizable] 14834 - Gene Total (N+Y) 43255 9005 Total 89529 18858

Annotated Relations:

Training Relations Development Relations INDIRECT-DOWNREGULATOR 1330 332 INDIRECT-UPREGULATOR 1379 302 DIRECT-REGULATOR 2250 458 ACTIVATOR 1429 246 INHIBITOR 5392 1152 AGONIST 659 131 AGONIST-ACTIVATOR 29 10 AGONIST-INHIBITOR 13 2 ANTAGONIST 972 218 PRODUCT-OF 921 158 SUBSTRATE 2003 495 SUBSTRATE_PRODUCT-OF 25 3 PART-OF 886 258 Total 17288 3765

For further information, please visit https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/ or email us at krallinger.martin@gmail.com and antoniomiresc@gmail.com

Related resources:

Files

drugprot-training-development-test-background.zip

Files (13.4 MB)

Name	Size	Download all
drugprot-training-development-test-background.zip md5:c706ebf04580c3126154d96a43e64f2e	13.4 MB	Preview Download

	All versions	This version
Views	6,994	2,388
Downloads	1,569	798
Data volume	15.9 GB	13.0 GB

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Creators

Description

Files

drugprot-training-development-test-background.zip

Files (13.4 MB)