Published June 19, 2025 | Version v3
Dataset Open

OAXMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification

Description

The OAXMLC dataset comprises 3'725'870 scientific documents, publications that are related to Computer Science. It includes labeled, annotated data such as various computer science categories, domains related to the documents, authors, year of publication and references to other documents. With the help of those annotations, example tasks that can be trained using this dataset include:

  • Document tagging or classification among a large amount of categories (extreme multi-label classification, or XMLC)
  • Authors prediction
  • Year of publication prediction
  • Reference/link prediction

For example, beyond XMLC, which is the first use-case of OAXMLC, the references field of the documents can be leveraged to build a citation graph, and this graph can be used to e.g., predict missing citations, improve the labeling of documents, or identify clusters of papers, which may help with the detection of trends and emergence of new topics in computer science research. 

Importantly, this dataset is equipped with two independent taxonomies and set of labels (see below), opening multiple possibilities, including

  • Principled investigation of the influence of taxonomies on XMLC algorithms
  • Transfer learning in XMLC (from one taxonomy to the other)

Each taxonomy is provided both in a turtle/SKOS format, as well as in a json/txt format for easier XMLC usage.

The dataset was built with data coming from the OpenAlex open catalog. OpenAlex regroups entities including works, authors and institutions, as well as topics or concepts. See their official documentation for more information. Since the database of OpenAlex is in constant evolution, we downloaded a snapshot of the database on the 20th January 2025. This means that new entities added to OpenAlex after the aforementioned downloading date are not included in this dataset.

Additionally to the documents, we created two label taxonomies. These taxonomies represent categories, from the Computer Science domain, that are hierarchically organized, and on which documents are assigned to. One taxonomy is built from the OpenAlex topics and OpenAlex keywords and is further referred as the topics taxonomy. It contains 776 categories split in 3 levels. The second taxonomy is built from the OpenAlex concepts, referred as the concepts taxonomy and consists of 8'927 categories split in 5 levels. The technical details about the construction of those taxonomies are given below.

More information can be found in the README.md file.

Examples to load the dataset can be found in the OAXMLC_examples.ipynb file.

The code used to benchmark this dataset is available on GitHub.

OAXMLC is currently restricted to documents pertaining to the field of Computer Science resulting in 3.7 M documents and a concepts taxonomy with almost 9’000 labels. Future versions will include additional fields, with Medicine being the next in-line. Additionally, OAXMLC only contains the english version of documents and labels. This choice is motivated by the fact that the overwhelming majority of scientific papers are only available in this language. Finally, and while the dataset was manually curated with great care, it is always possible that some errors may have slipped through, so the user discretion is advised.

Files

concepts.zip

Files (6.3 GB)

Name Size Download all
md5:8a6f61a0cf3b4b2f51c4971a14fd8694
576.3 kB Preview Download
md5:6ec209c9092012eaa574bf10f4e04c17
6.3 GB Preview Download
md5:fa8bca411e2443e3303c1f2f60bf7c32
12.3 kB Preview Download
md5:384b4207765f8d5d946a777ac32a60e4
4.0 MB Download
md5:823ab38a95d38bb15a152569d9f5985a
505.7 kB Download
md5:d2a6b66a0e727cd786a6e1d8b25ceced
29.7 kB Preview Download
md5:804ce51572c43e5d1b91092554071b27
20.4 kB Download
md5:c1e1aa3d29778aadb4bb6cbbb04aeefd
82.3 kB Preview Download

Additional details

Dates

Submitted
2025-05-08
submitted to ISWC 2025