Published July 9, 2021 | Version v1
Dataset Open

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

  • 1. United States National Library of Medicine

Description

The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement.  Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/. The gene finding results of applying this tool to the entire PubMed/PMC are freely accessible through our web-based tool PubTator.

Notes

NLM-Gene consists of 550 PubMed articles, from 156 journals, and contains more than 15 thousand unique gene names, corresponding to more than five thousand gene identifiers (NCBI Gene taxonomy). This corpus contains gene annotation data from 28 organisms. The annotated articles contain on average 29 gene names, and 10 gene identifiers per article. These characteristics demonstrate that this article set is an important benchmark dataset to test the accuracy of gene recognition algorithms both on multi-species and ambiguous data. The NLM-Gene corpus will be invaluable for advancing text-mining techniques for gene identification tasks in biomedical text.

In order to achieve a robust result of gene entity recognition that could translate to real life applications, we upgraded the GNormPlus system with a deep learning component for the name entity recognition component and several features that ensured better accuracy for species recognition, and false positive prediction detection. The new results are superior and are able to identify genes in the NLM-Gene test dataset close to the performance of human inter-annotator agreement. These results have been streamlined to process all PubMed articles in daily updates: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/.

Funding provided by: U.S. National Library of Medicine
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000092
Award Number:

Files

NLM-Gene-Corpus.zip

Files (1.1 MB)

Name Size Download all
md5:366e6424f3d07a571105d9ca76cdaf07
117.7 kB Download
md5:e3edfb2c84c9e4572b5d285c5dc2f015
952.2 kB Preview Download
md5:c61123961143e068780af54294a622e4
900 Bytes Preview Download
md5:7de2ffb355476d474daee895c443df36
4.0 kB Preview Download

Additional details

Related works

Is cited by
10.1016/j.jbi.2021.103779 (DOI)