Published September 10, 2025 | Version 1.0
Dataset Open

OAMED-XMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification on Medical Documents

Description

The OAMEDXMLC dataset comprises 869'402 scientific documents, publications that are related to Surgery. It includes labeled, annotated data such as various surgery categories, domains related to the documents, authors, year of publication and references to other documents. With the help of those annotations, example tasks that can be trained using this dataset include:
  • Document tagging or classification among a large amount of categories (extreme multi-label classification, or XMLC)
  • Authors prediction
  • Year of publication prediction
  • Reference/link prediction
 
Note that this is an extension of the OAXMLC dataset https://zenodo.org/records/15309916
Importantly, this dataset is equipped with two independent taxonomies and set of labels, opening multiple possibilities, including
  • Principled investigation of the influence of taxonomies on XML algorithms
  • Transfer learning in XMLC (from one taxonomy to the other)
Each taxonomy is provided both in a turtle/SKOS format, as well as in a json/txt format for easier XMLC usage.

The dataset was built with data coming from the OpenAlex[OpenAlex](https://openalex.org/) open catalog. 
More detail can be found in the README.md file as well as in the original dataset https://zenodo.org/records/15309916

Files

concepts.zip

Files (2.0 GB)

Name Size Download all
md5:039a9c06156f2d961e8187f63c7ba3ad
209.8 kB Preview Download
md5:b9ff13279daf5b2d715d72b137f85b04
2.0 GB Preview Download
md5:2f3c016b3cacc54c90cf70eff1ce462d
11.0 kB Preview Download
md5:7191c1c35893686dc741948fe4bdb2c6
1.5 MB Download
md5:15ca5788ce37c692e6f298a42aa0294c
151.2 kB Download
md5:4d06a0d31f18f8688fca64339508cc78
15.9 kB Preview Download
md5:c89594f11b245d3f0f3da47052886f47
28.4 kB Preview Download