Dataset Open Access

NLMChem a new resource for chemical entity recognition in PubMed full-text literature

Islamaj, Rezarta; Leaman, Robert; Lu, Zhiyong


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/9f34778c-83d5-4936-bba1-3f59cb582314/NLM-Chem-Annotation-Guidelines.docx"
      }, 
      "checksum": "md5:ed69f8c5c543f60cbc695bffd6b76766", 
      "bucket": "9f34778c-83d5-4936-bba1-3f59cb582314", 
      "key": "NLM-Chem-Annotation-Guidelines.docx", 
      "type": "docx", 
      "size": 42154
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/9f34778c-83d5-4936-bba1-3f59cb582314/NLM-Chem-corpus.zip"
      }, 
      "checksum": "md5:5400a07d69b211a02c9f026f5eb54ebe", 
      "bucket": "9f34778c-83d5-4936-bba1-3f59cb582314", 
      "key": "NLM-Chem-corpus.zip", 
      "type": "zip", 
      "size": 2533277
    }
  ], 
  "owners": [
    90070
  ], 
  "doi": "10.5061/dryad.3tx95x6dz", 
  "stats": {
    "version_unique_downloads": 7.0, 
    "unique_views": 31.0, 
    "views": 34.0, 
    "version_views": 34.0, 
    "unique_downloads": 7.0, 
    "version_unique_views": 31.0, 
    "volume": 12835001.0, 
    "version_downloads": 9.0, 
    "downloads": 9.0, 
    "version_volume": 12835001.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5061/dryad.3tx95x6dz", 
    "latest_html": "https://zenodo.org/record/4628233", 
    "bucket": "https://zenodo.org/api/files/9f34778c-83d5-4936-bba1-3f59cb582314", 
    "badge": "https://zenodo.org/badge/doi/10.5061/dryad.3tx95x6dz.svg", 
    "html": "https://zenodo.org/record/4628233", 
    "latest": "https://zenodo.org/api/records/4628233"
  }, 
  "created": "2021-03-22T19:48:16.453720+00:00", 
  "updated": "2021-03-23T12:27:23.939150+00:00", 
  "conceptrecid": "4628232", 
  "revision": 2, 
  "id": 4628233, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5061/dryad.3tx95x6dz", 
    "description": "<p>Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects, and interactions with diseases, genes, and other chemicals.\u00a0</p>\n\n<p>We, therefore, present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. Using this corpus, we built\u00a0a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API.\u00a0</p>", 
    "license": {
      "id": "CC0-1.0"
    }, 
    "title": "NLMChem a new resource for chemical entity recognition in PubMed full-text literature", 
    "notes": "<p>We include the document of annotation guidelines, which makes it clear that the corpus can be combined with ChemDNER and BC5CDR corpora, which contain chemical name annotations, and name and MeSH annotations for chemicals respectively to further improve Chemical NER in biomedical literature.\u00a0</p>\n\n<p>The corpus has been divided into\u00a0train/dev/test to facilitate benchmarking and comparisons.\u00a0</p>\n\n<p>The data annotations are inline in the BioC XML format, which is a minimalistic approach to facilitate text mining.\u00a0We also maintain a copy here:\u00a0<a href=\"https://www.ncbi.nlm.nih.gov/research/bionlp/\">https://www.ncbi.nlm.nih.gov/research/bionlp/</a></p>\n<p>Funding provided by: U.S. National Library of Medicine<br>Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000092<br>Award Number: Intramural Research Program</p>", 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "4628232"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "4628233"
          }
        }
      ]
    }, 
    "access_right": "open", 
    "communities": [
      {
        "id": "dryad"
      }
    ], 
    "keywords": [
      "Natural Language Processing", 
      "chemical named entity recognition", 
      "named entity recognition", 
      "Text mining", 
      "literature mining", 
      "chemical informatics", 
      "cheminformatics", 
      "bioinformatics"
    ], 
    "publication_date": "2021-03-22", 
    "creators": [
      {
        "orcid": "0000-0001-5651-1860", 
        "affiliation": "United States National Library of Medicine", 
        "name": "Islamaj, Rezarta"
      }, 
      {
        "orcid": "0000-0003-3296-5766", 
        "affiliation": "United States National Library of Medicine", 
        "name": "Leaman, Robert"
      }, 
      {
        "affiliation": "United States National Library of Medicine", 
        "name": "Lu, Zhiyong"
      }
    ], 
    "method": "<p>NLM-Chem\u00a0corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition.\u00a0</p>\n\n<p>Ten indexing experts at the National Library of Medicine manually annotated the corpus using the TeamTat annotation system that allows swift annotation project management. The corpus was annotated in three batches and each batch of articles was annotated in three annotation rounds. Annotators were randomly paired for each article, and pairings were randomly shuffled for\u00a0each subsequent batch. In this manner, the workload was distributed fairly. To control for bias, annotator identities were hidden the first two annotation rounds. In the final annotation rounds, annotators worked collaboratively to resolve the final few annotation disagreements and reach a 100% consensus.\u00a0</p>\n\n<p>The full-text articles were fully annotated for all chemical name occurrences in text, and the chemicals were mapped to Medical Subject Heading (MeSH) entries to facilitate indexing and other downstream article processing tasks at the National Library of Medicine. MeSH is part of the UMLS and as such, chemical entities can be\u00a0mapped to other standard vocabularies.\u00a0</p>\n\n<p>The data has been evaluated for high annotation quality, and its use as training data has already improved chemical named entity recognition in PubMed.\u00a0The newly improved system has already been\u00a0incorporated in the PubTator API tools (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).</p>", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "url", 
        "identifier": "https://academic.oup.com/nar/article/48/W1/W5/5834578", 
        "relation": "cites"
      }
    ]
  }
}