DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Krallinger, Martin; Rabal, Obdulia; Miranda-Escalada, Antonio; Valencia, Alfonso

doi:10.5281/zenodo.4955411

Published June 15, 2021 | Version 1.0

Dataset Open

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

1. Barcelona Supercomputing Center

Newer version (1.1) contains the training and the development sets: https://zenodo.org/record/5042151

Gold Standard annotations of the DrugProt corpus (training set)

Introduction

The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins. We have therefore generated a manually annotated corpus, the DrugProt corpus, where domain experts have exhaustively labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (DrugProt relation classes). There is also an increasing interested in the integration of chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research. A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.

The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.

The DrugProt track in BioCreative VII (BC VII) will explore recognition of chemical-protein entity relations from abstracts.

Teams participating in this track are provided with:

PubMed abstracts
Manually annotated chemical compound mentions
Manually annotated gene/protein mentions
Manually annotated chemical compound-protein relations

Zip structure:

Training set folder with
- drugprot_training_abstracts.tsv: PubMed records
- drugprot_training_entities.tsv: manually labeled mention annotations of chemical compounds and genes/proteins
- drugprot_training_relations.tsv: chemical-protein relation annotations

Data format description

The input files for the DrugProt track will be plain-text, UTF8-encoded PubMed records in a tab-separated format with the following three columns:

Article identifier (PMID, PubMed identifier)
Title of the article
Abstract of the article

DrugProt entity mention annotation files do contain manually labeled mention annotations of chemical compounds and genes/proteins (so-called gene and protein-related objects – GPRO as defined during BioCreative V). Such files consist of tab-separated fields containing the following three columns:

1Article identifier (PMID)
Entity or term number (for this record)
Type of entity mention (CHEMICAL, GENE-Y, GENE-N)
Start character offset of the entity mention
End character offset of the entity mention
Text string of the entity mention

Example DrugProt entity mention annotations:

11808879	T12	GENE-Y	1860	1866	KIR6.2
11808879	T13	GENE-N	1993	2016	glutamate dehydrogenase
11808879	T14	GENE-Y	2242	2253	glucokinase
23017395	T1	CHEMICAL	216	223	HMG-CoA
23017395	T2	CHEMICAL	258	261	EPA

DrugProt relation annotations will be distributed as a file that contains the detailed chemical-protein relation annotations prepared for the DrugProt track. It consists of tab-separated columns containing:

Article identifier (PMID)
DrugProt relation
Interactor argument 1
Interactor argument 2

Example DrugProt relation annotations:

12488248	INHIBITOR	Arg1:T1	Arg2:T52
12488248	INHIBITOR	Arg1:T2	Arg2:T52
23220562	ACTIVATOR	Arg1:T12	Arg2:T42
23220562	ACTIVATOR	Arg1:T12	Arg2:T43
23220562	INDIRECT-DOWNREGULATOR	Arg1:T1	Arg2:T14

Please, cite:

@inproceedings{krallinger2017overview, title={Overview of the BioCreative VI chemical-protein interaction Track}, author={Krallinger, Martin and Rabal, Obdulia and Akhondi, Saber A and P{\'e}rez, Mart{\i}n P{\'e}rez and Santamar{\'\i}a, Jes{\'u}s and Rodr{\'\i}guez, Gael P{\'e}rez and others}, booktitle={Proceedings of the sixth BioCreative challenge evaluation workshop}, volume={1}, pages={141--146}, year={2017}}

Summary statistics:

			Training set
Documents		3500
Tokens			1001168
Annotated Entities	89529
Annotated Relations	17288

Annotated Entities:

				Annotated Entities
CHEMICAL			46274
GENE-Y [Normalizable]		28421
GENE-N [Non-Normalizable]	14834
Gene Total (N+Y)		43255
Total				89529

Annotated Relations:

			Annotated Relations
INDIRECT-DOWNREGULATOR	1330
INDIRECT-UPREGULATOR	1379
DIRECT-REGULATOR	2250
ACTIVATOR		1429
INHIBITOR		5392
AGONIST			659
AGONIST-ACTIVATOR	29
AGONIST-INHIBITOR	13
ANTAGONIST		972
PRODUCT-OF		921
SUBSTRATE		2003
SUBSTRATE_PRODUCT-OF	25
PART-OF			886
Total 			17288

For further information, please visit https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/ or email us at krallinger.martin@gmail.com and antoniomiresc@gmail.com

Related resources:

Notes

DrugProt corpus is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).

Files

drugprot-gs.zip

Files (3.1 MB)

Name	Size	Download all
drugprot-gs.zip md5:0c11a875b9066a19571157ece0df6f63	3.1 MB	Preview Download

	All versions	This version
Views	8,610	3,228
Downloads	2,142	376
Data volume	23.5 GB	1.2 GB

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Authors/Creators

Description

Notes

Files

drugprot-gs.zip

Files (3.1 MB)