DrugProt Large-Scale Text Mining corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Miranda-Escalada, Antonio; Jouni Luoma; Farrokh Mehryary; Sampo Pyysalo; Krallinger, Martin

doi:10.5281/zenodo.5119879

Published July 21, 2021 | Version 1.0

Journal article Open

DrugProt Large-Scale Text Mining corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

1. Barcelona Supercomputing Center
2. University of Turku

This Zenodo contains the BioCreative VII Large scale DrugProt Additional Subtrack abstracts and entity annotations.

Please cite if you use any DrugProt resource:

Antonio Miranda-Escalada, Farrokh Mehryary, Jouni Luoma, Darryl Estrada-Zavala, Luis Gasco, Sampo Pyysalo, Alfonso Valencia, Martin Krallinger, Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations, Database, Volume 2023, 2023, baad080

@article{miranda2023overview, title={Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical--protein relations}, author={Miranda-Escalada, Antonio and Mehryary, Farrokh and Luoma, Jouni and Estrada-Zavala, Darryl and Gasco, Luis and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin}, journal={Database}, volume={2023}, pages={baad080}, year={2023}, publisher={Oxford University Press UK} }

Miranda, Antonio, et al. "Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations." Proceedings of the seventh BioCreative challenge evaluation workshop. 2021.

@inproceedings{miranda2021overview, title={Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations}, author={Miranda, Antonio and Mehryary, Farrokh and Luoma, Jouni and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin}, booktitle={Proceedings of the seventh BioCreative challenge evaluation workshop}, year={2021} }

Abstracts

large_scale_abstracts.tsv This file contains plain-text, UTF8-encoded, NFC normalized DrugProt PubMed records in a tab ‐ separated format. In total 2366081 records are provided, where each line in the fails contains a single PMID, title and abstract separated by tabulators. Due to PubMed inconsistencies, there is a minor percentage of duplicated records. Indeed, we have identified 222 records with different PMID but the same abstract title and body.

Entity mention annotations

large_scale_entities.tsv. This file contains the automatically labeled mention annotations of chemical compounds and genes/proteins (so-called gene and protein-related objects as defined during BioCreative V) generated for the Large Scale records. There are 53993602 entity annotations.

Related resources:

Files

large-scale-drugprot.zip

Files (1.9 GB)

Name	Size	Download all
large-scale-drugprot.zip md5:ffafa66950258ef831817b26a286e79c	1.9 GB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	795	793
Downloads	182	182
Data volume	547.4 GB	547.4 GB

DrugProt Large-Scale Text Mining corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Creators

Description

Files

large-scale-drugprot.zip

Files (1.9 GB)