Published November 12, 2023 | Version v1
Conference proceeding Open

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop

  • 1. National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
  • 2. School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China

Description

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing to the participants the training BioRED corpus, a collection of 600 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pair-wise relationships between them being: disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categories into these semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, co- treatment, and association. Unlike the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such they are marked by their corresponding database concept identifiers. As such, diseases and chemicals are normalized to MeSH, genes (and proteins) to NCBI Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to dbSNP. Finally, each annotated relationship is categorized as novel depending on whether it is a novel finding, or experimental verification in the publication it is expressed in, so that it is distinguished from other relationships in the same text that provide known facts and/or background knowledge. The BioRED track at BioCreative VIII further provided 400 newly published articles annotated as above, to serve at the testing data for the challenge. All articles were manually annotated by expert biocurators at the National Library of Medicine (NLM), where each article is doubly annotated in a three round annotation process until full agreement is reached between all curators. This document details the characteristics of this novel resource for biomedical entity and relationship recognition. Using this new resource, we have demonstrated improvements in the biomedical entity and relationship recognition algorithms.

 

This article is part of the Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Files

BioRED-Corpus-paper-biocreativeVIII.pdf

Files (613.4 kB)

Name Size Download all
md5:bbd43f97c1e90d981ef65b2cc7bab2ac
613.4 kB Preview Download

Additional details

Related works

Is published in
Conference proceeding: 10.5281/zenodo.10103190 (DOI)