NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

Islamaj, Rezarta; Lu, Zhiyong

doi:10.5061/dryad.dv41ns1wt

Published July 9, 2021 | Version v1

Dataset Open

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

1. United States National Library of Medicine

The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/. The gene finding results of applying this tool to the entire PubMed/PMC are freely accessible through our web-based tool PubTator.

Notes

NLM-Gene consists of 550 PubMed articles, from 156 journals, and contains more than 15 thousand unique gene names, corresponding to more than five thousand gene identifiers (NCBI Gene taxonomy). This corpus contains gene annotation data from 28 organisms. The annotated articles contain on average 29 gene names, and 10 gene identifiers per article. These characteristics demonstrate that this article set is an important benchmark dataset to test the accuracy of gene recognition algorithms both on multi-species and ambiguous data. The NLM-Gene corpus will be invaluable for advancing text-mining techniques for gene identification tasks in biomedical text.

In order to achieve a robust result of gene entity recognition that could translate to real life applications, we upgraded the GNormPlus system with a deep learning component for the name entity recognition component and several features that ensured better accuracy for species recognition, and false positive prediction detection. The new results are superior and are able to identify genes in the NLM-Gene test dataset close to the performance of human inter-annotator agreement. These results have been streamlined to process all PubMed articles in daily updates: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/.

Funding provided by: U.S. National Library of Medicine
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000092
Award Number:

Files

NLM-Gene-Corpus.zip

Files (1.1 MB)

Name	Size	Download all
NLM-Gene-Annotation-Guidelines.docx md5:366e6424f3d07a571105d9ca76cdaf07	117.7 kB	Download
NLM-Gene-Corpus.zip md5:e3edfb2c84c9e4572b5d285c5dc2f015	952.2 kB	Preview Download
Pmidlist.Test.txt md5:c61123961143e068780af54294a622e4	900 Bytes	Preview Download
Pmidlist.Train.txt md5:7de2ffb355476d474daee895c443df36	4.0 kB	Preview Download

Additional details

Is cited by: 10.1016/j.jbi.2021.103779 (DOI)

	All versions	This version
Views	663	659
Downloads	721	717
Data volume	662.2 MB	658.3 MB

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

Authors/Creators

Description

Notes

Files

NLM-Gene-Corpus.zip

Files (1.1 MB)

Additional details

Related works