Dataset Open Access

DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Krallinger, Martin; Rabal, Obdulia; Miranda-Escalada, Antonio; Valencia, Alfonso


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Krallinger, Martin</dc:creator>
  <dc:creator>Rabal, Obdulia</dc:creator>
  <dc:creator>Miranda-Escalada, Antonio</dc:creator>
  <dc:creator>Valencia, Alfonso</dc:creator>
  <dc:date>2021-06-29</dc:date>
  <dc:description>Gold Standard annotations of the DrugProt corpus (training and development sets). Also, test and background sets.


 

Introduction

The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins. We have therefore generated a manually annotated corpus, the DrugProt corpus, where domain experts have exhaustively labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (DrugProt relation classes). There is also an increasing interested in the integration of chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research. A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.

The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.

The DrugProt track in BioCreative VII (BC VII) will explore recognition of chemical-protein entity relations from abstracts.

Teams participating in this track are provided with:


	PubMed abstracts
	Manually annotated chemical compound mentions
	Manually annotated gene/protein mentions
	Manually annotated chemical compound-protein relations


 

Zip structure:


	Training set folder with
	
		drugprot_training_abstracts.tsv: PubMed records
		drugprot_training_entities.tsv: manually labeled mention annotations of chemical compounds and genes/proteins
		drugprot_training_relations.tsv: chemical-­protein relation annotations
	
	
	Development set folder with
	
		drugprot_development_abstracts.tsv
		drugprot_development_entities.tsv
		drugprot_development_relations.tsv
	
	



	Test+background set folder with
	
		test_background_abstracts.tsv
		test_background_entities.tsv
	
	


 

Data format description

The input text files for the DrugProt track are plain-text, UTF8-encoded PubMed records in a tab-separated format with the following three columns:


	Article identifier (PMID, PubMed identifier)
	Title of the article
	Abstract of the article


 

DrugProt entity mention annotation files contain manually labeled mention annotations of chemical compounds and genes/proteins. Such files consist of tab-separated fields containing the following six columns:


	Article identifier (PMID)
	Term number (for this record)
	Type of entity mention (CHEMICAL, GENE-Y, GENE-N)
	Start character offset of the entity mention
	End character offset of the entity mention
	Text string of the entity mention


Each line contains one entity, and each entity is uniquely identified by its PMID and the Term Number. Besides, each annotation contains an annotation type, the start-offset -the index of the first character of the annotated span in the text-, the end-offset -the index of the first character after the annotated span- and the text spanned by the annotation.

Example DrugProt training entity mention annotations:

11808879	T1	GENE-Y	1860	1866	KIR6.2
11808879	T2	GENE-N	1993	2016	glutamate dehydrogenase
11808879	T3	GENE-Y	2242	2253	glucokinase
23017395	T1	CHEMICAL	216	223	HMG-CoA
23017395	T2	CHEMICAL	258	261	EPA

 

Example DrugProt development entity mention annotations (no distinction between GENE-Y and GENE-N):

11808879	T1	GENE	1860	1866	KIR6.2
11808879	T2	GENE	1993	2016	glutamate dehydrogenase
11808879	T3	GENE	2242	2253	glucokinase
23017395	T1	CHEMICAL	216	223	HMG-CoA
23017395	T2	CHEMICAL	258	261	EPA


DrugProt relation annotations are distributed as a file that contains the detailed chemical-protein relation annotations prepared for the DrugProt track. There are no relation annotations for the test+background set (the goal of the task is to predict them). It consists of tab-separated columns containing:


	Article identifier (PMID)
	DrugProt relation
	Interactor argument 1 (of type CHEMICAL)
	Interactor argument 2 (of type GENE)


Each line contains one relation, and each relation is identified by the PMID, the relation type and the two related entities. In the below example, to find the entities involved in the first relation, you must find the entities with Term Identifier T1 and T52 within the PMID 12488248.

Example DrugProt relation annotations:

12488248	INHIBITOR	Arg1:T1	Arg2:T52
12488248	INHIBITOR	Arg1:T2	Arg2:T52
23220562	ACTIVATOR	Arg1:T12	Arg2:T42
23220562	ACTIVATOR	Arg1:T12	Arg2:T43
23220562	INDIRECT-DOWNREGULATOR	Arg1:T1	Arg2:T14

 

Please, cite:

@inproceedings{krallinger2017overview, title={Overview of the BioCreative VI chemical-protein interaction Track}, author={Krallinger, Martin and Rabal, Obdulia and Akhondi, Saber A and P{\'e}rez, Mart{\i}n P{\'e}rez and Santamar{\'\i}a, Jes{\'u}s and Rodr{\'\i}guez, Gael P{\'e}rez and others}, booktitle={Proceedings of the sixth BioCreative challenge evaluation workshop}, volume={1}, pages={141--146}, year={2017}}

 

Summary statistics:

			Training set	Development set
Documents		3500		750
Tokens			1001168		199620
Annotated Entities	89529		18858
Annotated Relations	17288		3765

 

Annotated Entities:

				Training Entities	Development Entities
CHEMICAL			46274			9853
GENE-Y [Normalizable]		28421			-
GENE-N [Non-Normalizable]	14834			-
Gene Total (N+Y)		43255			9005
Total				89529			18858

 

Annotated Relations:

			Training Relations	Development Relations
INDIRECT-DOWNREGULATOR	1330			332
INDIRECT-UPREGULATOR	1379			302
DIRECT-REGULATOR	2250			458
ACTIVATOR		1429			246
INHIBITOR		5392			1152
AGONIST			659			131
AGONIST-ACTIVATOR	29			10
AGONIST-INHIBITOR	13			2
ANTAGONIST		972			218
PRODUCT-OF		921			158
SUBSTRATE		2003			495
SUBSTRATE_PRODUCT-OF	25			3
PART-OF			886			258
Total 			17288			3765

 

For further information, please visit https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/ or email us at krallinger.martin@gmail.com and antoniomiresc@gmail.com

 

Related resources:


	Web
	Evaluation library
	Relation annotation guidelines
	Gene and protein annotation guidelines
	Chemicals and drugs annotation guidelines
	FAQ
	DrugProt Large Scale Additional SubTrack
</dc:description>
  <dc:identifier>https://zenodo.org/record/5119892</dc:identifier>
  <dc:identifier>10.5281/zenodo.5119892</dc:identifier>
  <dc:identifier>oai:zenodo.org:5119892</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:relation>doi:10.5281/zenodo.4955410</dc:relation>
  <dc:relation>url:https://zenodo.org/communities/medicalnlp</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>NLP</dc:subject>
  <dc:subject>relation extraction</dc:subject>
  <dc:subject>NER</dc:subject>
  <dc:subject>biomedical NLP</dc:subject>
  <dc:subject>biocreative</dc:subject>
  <dc:title>DrugProt corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
2,988
535
views
downloads
All versions This version
Views 2,988685
Downloads 535204
Data volume 3.9 GB2.7 GB
Unique views 2,296604
Unique downloads 457177

Share

Cite as