OAXMLC: a Two-Taxonomy Dataset for Benchmarking Extreme Multi-Label Classification
Authors/Creators
Description
The OAXMLC dataset comprises 3'725'870 scientific documents, publications that are related to Computer Science. It includes labeled, annotated data such as various computer science categories, domains related to the documents, authors, year of publication and references to other documents. With the help of those annotations, example tasks that can be trained using this dataset include:
- Document tagging or classification among a large amount of categories (extreme multi-label classification, or XMLC)
- Authors prediction
- Year of publication prediction
- Reference/link prediction
For example, beyond XMLC, which is the first use-case of OAXMLC, the references field of the documents can be leveraged to build a citation graph, and this graph can be used to e.g., predict missing citations, improve the labeling of documents, or identify clusters of papers, which may help with the detection of trends and emergence of new topics in computer science research.
Importantly, this dataset is equipped with two independent taxonomies and set of labels (see below), opening multiple possibilities, including
- Principled investigation of the influence of taxonomies on XMLC algorithms
- Transfer learning in XMLC (from one taxonomy to the other)
Each taxonomy is provided both in a turtle/SKOS format, as well as in a json/txt format for easier XMLC usage.
The dataset was built with data coming from the OpenAlex open catalog. OpenAlex regroups entities including works, authors and institutions, as well as topics or concepts. See their official documentation for more information. Since the database of OpenAlex is in constant evolution, we downloaded a snapshot of the database on the 20th January 2025. This means that new entities added to OpenAlex after the aforementioned downloading date are not included in this dataset.
Additionally to the documents, we created two label taxonomies. These taxonomies represent categories, from the Computer Science domain, that are hierarchically organized, and on which documents are assigned to. One taxonomy is built from the OpenAlex topics and OpenAlex keywords and is further referred as the topics taxonomy. It contains 776 categories split in 3 levels. The second taxonomy is built from the OpenAlex concepts, referred as the concepts taxonomy and consists of 8'927 categories split in 5 levels. The technical details about the construction of those taxonomies are given below.
More information can be found in the README.md file.
Examples to load the dataset can be found in the OAXMLC_examples.ipynb file.
The code used to benchmark this dataset is available on GitHub.
OAXMLC is currently restricted to documents pertaining to the field of Computer Science resulting in 3.7 M documents and a concepts taxonomy with almost 9’000 labels. Future versions will include additional fields, with Medicine being the next in-line. Additionally, OAXMLC only contains the english version of documents and labels. This choice is motivated by the fact that the overwhelming majority of scientific papers are only available in this language. Finally, and while the dataset was manually curated with great care, it is always possible that some errors may have slipped through, so the user discretion is advised.
Files
concepts.zip
Files
(6.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:8a6f61a0cf3b4b2f51c4971a14fd8694
|
576.3 kB | Preview Download |
|
md5:6ec209c9092012eaa574bf10f4e04c17
|
6.3 GB | Preview Download |
|
md5:fa8bca411e2443e3303c1f2f60bf7c32
|
12.3 kB | Preview Download |
|
md5:384b4207765f8d5d946a777ac32a60e4
|
4.0 MB | Download |
|
md5:823ab38a95d38bb15a152569d9f5985a
|
505.7 kB | Download |
|
md5:d2a6b66a0e727cd786a6e1d8b25ceced
|
29.7 kB | Preview Download |
|
md5:804ce51572c43e5d1b91092554071b27
|
20.4 kB | Download |
|
md5:c1e1aa3d29778aadb4bb6cbbb04aeefd
|
82.3 kB | Preview Download |
Additional details
Dates
- Submitted
-
2025-05-08submitted to ISWC 2025