OAXMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification

Broillet, Christophe; Cudre-Mauroux, Philippe; Audiffren, Julien

doi:10.5281/zenodo.15695796

Published June 19, 2025 | Version v3

Dataset Open

OAXMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification

1. University of Fribourg

The OAXMLC dataset comprises 3'725'870 scientific documents, publications that are related to Computer Science. It includes labeled, annotated data such as various computer science categories, domains related to the documents, authors, year of publication and references to other documents. With the help of those annotations, example tasks that can be trained using this dataset include:

Document tagging or classification among a large amount of categories (extreme multi-label classification, or XMLC)
Authors prediction
Year of publication prediction
Reference/link prediction

For example, beyond XMLC, which is the first use-case of OAXMLC, the references field of the documents can be leveraged to build a citation graph, and this graph can be used to e.g., predict missing citations, improve the labeling of documents, or identify clusters of papers, which may help with the detection of trends and emergence of new topics in computer science research.

Importantly, this dataset is equipped with two independent taxonomies and set of labels (see below), opening multiple possibilities, including

Principled investigation of the influence of taxonomies on XMLC algorithms
Transfer learning in XMLC (from one taxonomy to the other)

Each taxonomy is provided both in a turtle/SKOS format, as well as in a json/txt format for easier XMLC usage.

The dataset was built with data coming from the OpenAlex open catalog. OpenAlex regroups entities including works, authors and institutions, as well as topics or concepts. See their official documentation for more information. Since the database of OpenAlex is in constant evolution, we downloaded a snapshot of the database on the 20th January 2025. This means that new entities added to OpenAlex after the aforementioned downloading date are not included in this dataset.

Additionally to the documents, we created two label taxonomies. These taxonomies represent categories, from the Computer Science domain, that are hierarchically organized, and on which documents are assigned to. One taxonomy is built from the OpenAlex topics and OpenAlex keywords and is further referred as the topics taxonomy. It contains 776 categories split in 3 levels. The second taxonomy is built from the OpenAlex concepts, referred as the concepts taxonomy and consists of 8'927 categories split in 5 levels. The technical details about the construction of those taxonomies are given below.

More information can be found in the README.md file.

Examples to load the dataset can be found in the OAXMLC_examples.ipynb file.

The code used to benchmark this dataset is available on GitHub.

OAXMLC is currently restricted to documents pertaining to the field of Computer Science resulting in 3.7 M documents and a concepts taxonomy with almost 9’000 labels. Future versions will include additional fields, with Medicine being the next in-line. Additionally, OAXMLC only contains the english version of documents and labels. This choice is motivated by the fact that the overwhelming majority of scientific papers are only available in this language. Finally, and while the dataset was manually curated with great care, it is always possible that some errors may have slipped through, so the user discretion is advised.

Files

concepts.zip

Files (6.3 GB)

Name	Size	Download all
concepts.zip md5:8a6f61a0cf3b4b2f51c4971a14fd8694	576.3 kB	Preview Download
documents.json md5:6ec209c9092012eaa574bf10f4e04c17	6.3 GB	Preview Download
OAXMLC_examples.ipynb md5:fa8bca411e2443e3303c1f2f60bf7c32	12.3 kB	Preview Download
ontology_concepts.ttl md5:384b4207765f8d5d946a777ac32a60e4	4.0 MB	Download
ontology_topics.ttl md5:823ab38a95d38bb15a152569d9f5985a	505.7 kB	Download
README.md md5:d2a6b66a0e727cd786a6e1d8b25ceced	29.7 kB	Preview Download
taxonomy.py md5:804ce51572c43e5d1b91092554071b27	20.4 kB	Download
topics.zip md5:c1e1aa3d29778aadb4bb6cbbb04aeefd	82.3 kB	Preview Download

Additional details

Submitted: 2025-05-08

submitted to ISWC 2025

	All versions	This version
Views	315	148
Downloads	919	614
Data volume	1.8 TB	1.6 TB

OAXMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification

Authors/Creators

Description

Files

concepts.zip

Files (6.3 GB)

Additional details

Dates