BIOMAT-NER: A Domain-Specific Corpus for Named Entity Recognition of Chemical Substances and Biomaterials

Rosell, Judith; Veiranto, Minna; Juusela, Maiju; Piegat, Agnieszka; Uribe-Gomez, Juan; Mas, Carles; PEGUEROLES, MARTA; Rodríguez Miret, Jan; Rodríguez Ortega, Miguel; Krallinger, Martin

doi:10.5281/zenodo.15276149

Published April 24, 2025 | Version v1

Dataset Restricted

BIOMAT-NER: A Domain-Specific Corpus for Named Entity Recognition of Chemical Substances and Biomaterials

1. Barcelona Supercomputing Center
2. Tampere University
3. West Pomeranian University of Technology
4. Universitat Politècnica de Catalunya

BIOMAT-NER Corpus

BIOMAT-NER is a corpus developed within the scope of the Horizon Europe BIOMATDB project to support the extraction and classification of biomaterials-related concepts from the scientific literature. It focuses on the annotation of chemical substances, compounds, and material types—including trade names—relevant to the field of biomaterials. The corpus was created through a collaborative effort involving domain experts, who were tasked with the establishment of comprehensive and accurate annotation guidelines for the manual annotation of the final gold standard corpus. On this basis, PubMed abstracts were carefully selected based on relevant MeSH (Medical Subject Headings) categories associated with biomaterials and related disciplines to reflect the terminology commonly used in biomaterials research and manually annotated according to the predefined rules in the annotation guidelines.

The BIOMAT-NER corpus is one of four developed within the project and is divided into three subsets: a training set (4,553 documents), a test set (911 documents), and a validation set (607 documents), available in multiple formats, including brat, CSV and CoNLL.

This corpus is part of a broader initiative to support the development of an advanced, searchable biomaterials database with integrated analytical tools and digital advisors. It is also intended for use in training Named Entity Recognition (NER) models, enabling the automatic identification and extraction of biomaterials-related concepts from scientific texts.

Resources

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

	All versions	This version
Views	34	34
Downloads	1	1
Data volume	72.0 MB	72.0 MB

BIOMAT-NER: A Domain-Specific Corpus for Named Entity Recognition of Chemical Substances and Biomaterials

Creators

Description

BIOMAT-NER Corpus

Resources

Files

Restricted