NLMChem a new resource for chemical entity recognition in PubMed full-text literature

Islamaj, Rezarta; Leaman, Robert; Lu, Zhiyong

doi:10.5061/dryad.3tx95x6dz

Published March 22, 2021 | Version v1

Dataset Open

NLMChem a new resource for chemical entity recognition in PubMed full-text literature

1. United States National Library of Medicine

Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects, and interactions with diseases, genes, and other chemicals.

We, therefore, present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. Using this corpus, we built a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API.

Notes

We include the document of annotation guidelines, which makes it clear that the corpus can be combined with ChemDNER and BC5CDR corpora, which contain chemical name annotations, and name and MeSH annotations for chemicals respectively to further improve Chemical NER in biomedical literature.

The corpus has been divided into train/dev/test to facilitate benchmarking and comparisons.

The data annotations are inline in the BioC XML format, which is a minimalistic approach to facilitate text mining. We also maintain a copy here: https://www.ncbi.nlm.nih.gov/research/bionlp/

Funding provided by: U.S. National Library of Medicine
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000092
Award Number: Intramural Research Program

Files

NLM-Chem-corpus.zip

Files (2.6 MB)

Name	Size	Download all
NLM-Chem-Annotation-Guidelines.docx md5:ed69f8c5c543f60cbc695bffd6b76766	42.2 kB	Download
NLM-Chem-corpus.zip md5:5400a07d69b211a02c9f026f5eb54ebe	2.5 MB	Preview Download

Additional details

Is cited by: https://academic.oup.com/nar/article/48/W1/W5/5834578 (URL)

	All versions	This version
Views	817	816
Downloads	237	237
Data volume	289.2 MB	289.2 MB

NLMChem a new resource for chemical entity recognition in PubMed full-text literature

Authors/Creators

Description

Notes

Files

NLM-Chem-corpus.zip

Files (2.6 MB)

Additional details

Related works