Journal article Open Access

DrugProt Large-Scale Text Mining corpus: Biocreative VII Track 1 - Text mining drug and chemical-protein interactions

Miranda-Escalada, Antonio; Jouni Luoma; Farrokh Mehryary; Sampo Pyysalo; Krallinger, Martin

This Zenodo contains the BioCreative VII Large scale DrugProt Additional Subtrack abstracts and entity annotations.

 

Abstracts

  • large_scale_abstracts.tsv This file contains plain-­text, UTF8-­encoded, NFC normalized DrugProt PubMed records in a tab­ ‐ separated format. In total 2366081 records are provided, where each line in the fails contains a single PMID, title and abstract separated by tabulators. Due to PubMed inconsistencies, there is a minor percentage of duplicated records. Indeed, we have identified 222 records with different PMID but the same abstract title and body.

 

Entity mention annotations

  • large_scale_entities.tsv. This file contains the automatically labeled mention annotations of chemical compounds and genes/proteins (so-­called gene and protein-related objects as defined during BioCreative V) generated for the Large Scale records. There are 53993602 entity annotations.

 

Related resources:

Files (1.9 GB)
Name Size
large-scale-drugprot.zip
md5:ffafa66950258ef831817b26a286e79c
1.9 GB Download
412
82
views
downloads
All versions This version
Views 412412
Downloads 8282
Data volume 158.1 GB158.1 GB
Unique views 326326
Unique downloads 7777

Share

Cite as