Published April 24, 2025 | Version v1
Dataset Restricted

BIOMAT-NER: A Domain-Specific Corpus for Named Entity Recognition of Chemical Substances and Biomaterials

  • 1. ROR icon Barcelona Supercomputing Center
  • 2. ROR icon Tampere University
  • 3. ROR icon West Pomeranian University of Technology
  • 4. ROR icon Universitat Politècnica de Catalunya

Description

BIOMAT-NER Corpus

BIOMAT-NER is a corpus developed within the scope of the Horizon Europe BIOMATDB project to support the extraction and classification of biomaterials-related concepts from the scientific literature. It focuses on the annotation of chemical substances, compounds, and material types—including trade names—relevant to the field of biomaterials. The corpus was created through a collaborative effort involving domain experts, who were tasked with the establishment of comprehensive and accurate annotation guidelines for the manual annotation of the final gold standard corpus. On this basis, PubMed abstracts were carefully selected based on relevant MeSH (Medical Subject Headings) categories associated with biomaterials and related disciplines to reflect the terminology commonly used in biomaterials research and manually annotated according to the predefined rules in the annotation guidelines.

The BIOMAT-NER corpus is one of four developed within the project and is divided into three subsets: a training set (4,553 documents), a test set (911 documents), and a validation set (607 documents), available in multiple formats, including brat, CSV and CoNLL.

This corpus is part of a broader initiative to support the development of an advanced, searchable biomaterials database with integrated analytical tools and digital advisors. It is also intended for use in training Named Entity Recognition (NER) models, enabling the automatic identification and extraction of biomaterials-related concepts from scientific texts.

Resources

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.