Published March 19, 2024 | Version v1

Archive of PubTator 3.0 source code and trained models

  • 1. ROR icon National Center for Biotechnology Information
  • 2. ROR icon Montreal Neurological Institute and Hospital
  • 3. ROR icon Dalian University of Technology
  • 4. ROR icon Yale School of Medicine

Description

PubTator3 is a web-based system that uses advanced natural language processing and AI methods to help researchers explore the biomedical literature. The PubTator 3.0 online interface can be found at https://www.ncbi.nlm.nih.gov/research/pubtator3/, its API can be found at https://www.ncbi.nlm.nih.gov/research/pubtator3/api and bulk data download can be found at https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/.

This repository archives the source code and models used by PubTator 3.0 to annotate PubMed and PMC articles, as of initial release in early 2024.

The AIONER named entity recognizer annotates genes/proteins, variants, chemicals, diseases, species, and cell lines. GNorm2 normalizes genes to NCBI Gene identifiers and species mentions to NCBI Taxonomy. tmVar3 normalizes genetic variants; it uses dbSNP identifiers for variants listed in dbSNP and HGNV format otherwise. The NLM-Chem tagger normalizes chemicals to MeSH identifiers. TaggerOne normalizes diseases to MeSH and cell lines to Cellosaurus. 

The BioREx relation extraction system simultaneously identifies 12 types of relations across eight entity type pairs: chemical-chemical, chemical-disease, chemical-gene, chemical-variant, disease-gene, disease-variant, gene-gene, and variant-variant.

The source and trained model for each tool is archived is a separate file, each containing instructions. The specific models used in PubTator 3.0 are as follows:

  • AIONER: AIONER_trained_models/AIONER/Bioformer-Softmax-BEST-AIO_tmvar3.20230416.h5
  • GNorm2: gene names: gnorm_trained_models/geneNER/GeneNER-Bioformer-AllSARS.h5 
  • GNorm2: species names: gnorm_trained_models/SpeAss/SpeAss-Bioformer-SG-Allset.h5 
  • BioREx: pubtator_rel/pre_biorex_model_biolinkbert

tmVar3 uses a large dictionary, available at https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3.Database.tar.gz.

This repository also archives the instructions for augmenting ChatGPT with PubTator APIs, as of early 2024. These instructions are found in the file PubTator-ChatGPT.txt.gz

DISCLAIMER
Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

These tools are the result of research conducted in the Computational Biology Branch, NCBI. The information produced is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced by these tools. NIH does not independently verify the validity or utility of the information produced by these tools. If you have questions about the information produced by these tools, please see a health care professional.

LICENSE
This software/database is a "United States Government Work" under the terms of the United States Copyright Act.  It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted.  This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

FUNDING
This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health.

Files

Files (12.8 GB)

Name Size
md5:3b5e922c2124896dfc1656caebb0fdcb
5.9 GB Download
md5:714fd6f37a21af5269a1ceb750efabd7
1.1 GB Download
md5:151b9d89f5fe4948637abf7e9695b64c
4.6 GB Download
md5:e687bfc27848cbd1d650da85ef226897
714.4 MB Download
md5:60fcc5caa2bf7c883f8ad2be19db3201
2.9 kB Download
md5:c05bab324cb1eb255e8a16988018ccf3
217.0 MB Download
md5:73e867cac75e53368f96e23655615106
368.6 MB Download

Additional details