BIOMAT-AnatNER: A Biomaterials Domain-Specific Corpus for Named Entity Recognition of Anatomical Structures
Creators
Description
BIOMAT-AnatNER Corpus
BIOMAT-AnatNER stands for BIOMATerials Anatomical Structure Named Entity Recognition. It is a corpus developed as part of the Horizon Europe BIOMATDB project to support the extraction and classification of anatomical structure mentions in scientific literature related to biomaterials. The corpus focuses specifically on the annotation of entities encompassing tissues, organs and body parts that appear in the context of biomaterials applications, such as mentions of tissues involved in implantation studies, or organs relevant to biocompatibility testing.
The corpus was created through a collaborative effort involving domain experts, who were tasked with the establishment of comprehensive and accurate annotation guidelines for the manual annotation of the final gold standard corpus. On this basis, PubMed abstracts were carefully selected based on MeSH (Medical Subject Headings) terms associated with relevant disciplines, such as regenerative medicine, orthopedics, dentistry and cardiology, to reflect the terminology commonly used in biomaterials research and manually annotated according to the rules predefinedin the annotation guidelines.
The BIOMAT-AnatNER corpus is one of four developed within the project and is divided into three subsets: a training set (750 documents), a test set (150 documents), and a validation set (100 documents), available in multiple formats, including brat, CSV and CoNLL.
This corpus is part of a broader initiative to support the development of an advanced, searchable biomaterials database with integrated analytical tools and digital advisors. It is also intended for use in training Named Entity Recognition (NER) models, enabling the automatic identification and extraction of anatomical structure mentions relevant to biomaterials research and development.