NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition
Description
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/. The gene finding results of applying this tool to the entire PubMed/PMC are freely accessible through our web-based tool PubTator.
Notes
Files
NLM-Gene-Corpus.zip
Additional details
Related works
- Is cited by
- 10.1016/j.jbi.2021.103779 (DOI)