Published July 21, 2021 | Version 1.0
Journal article Open

DrugProt Large-Scale Text Mining corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

  • 1. Barcelona Supercomputing Center
  • 2. University of Turku

Description

This Zenodo contains the BioCreative VII Large scale DrugProt Additional Subtrack abstracts and entity annotations.

Please cite if you use any DrugProt resource:

Antonio Miranda-Escalada, Farrokh Mehryary, Jouni Luoma, Darryl Estrada-Zavala, Luis Gasco, Sampo Pyysalo, Alfonso Valencia, Martin Krallinger, Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations, Database, Volume 2023, 2023, baad080

@article{miranda2023overview,  title={Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical--protein relations},  author={Miranda-Escalada, Antonio and Mehryary, Farrokh and Luoma, Jouni and Estrada-Zavala, Darryl and Gasco, Luis and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin},  journal={Database},  volume={2023},  pages={baad080},  year={2023},  publisher={Oxford University Press UK} }

Miranda, Antonio, et al. "Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations." Proceedings of the seventh BioCreative challenge evaluation workshop. 2021.

@inproceedings{miranda2021overview,  title={Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations},  author={Miranda, Antonio and Mehryary, Farrokh and Luoma, Jouni and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin},  booktitle={Proceedings of the seventh BioCreative challenge evaluation workshop},  year={2021} }

Abstracts

  • large_scale_abstracts.tsv This file contains plain-­text, UTF8-­encoded, NFC normalized DrugProt PubMed records in a tab­ ‐ separated format. In total 2366081 records are provided, where each line in the fails contains a single PMID, title and abstract separated by tabulators. Due to PubMed inconsistencies, there is a minor percentage of duplicated records. Indeed, we have identified 222 records with different PMID but the same abstract title and body.

 

Entity mention annotations

large_scale_entities.tsv. This file contains the automatically labeled mention annotations of chemical compounds and genes/proteins (so-­called gene and protein-related objects as defined during BioCreative V) generated for the Large Scale records. There are 53993602 entity annotations.

 

Related resources:

Files

large-scale-drugprot.zip

Files (1.9 GB)

Name Size Download all
md5:ffafa66950258ef831817b26a286e79c
1.9 GB Preview Download