Hierarchical Text Classification corpora

doi:10.5281/zenodo.7319519

Published December 14, 2022 | Version 1.0

Dataset Open

Hierarchical Text Classification corpora

1. Ca' Foscari University of Venice, Department of Environmental Sciences, Informatics and Statistics

A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.

The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories.
The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component.
Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details).

Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.

{ "text": <article text>, "labels": [<label1>, <label2>, ...] }

The hierarchical structure of labels in each dataset is documented in this repository.

These datasets have been presented in this paper:

"Hierarchical Text Classification and its Foundations: a Review of Current Research" - DOI: 10.3390/electronics13071199

Some of these datasets have also been used in:

"Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984
"A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link

These datasets are partially derived from previous work, namely:

[Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018
[WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134
[Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511

Notes (English)

Please consider citing these papers if you use this data:

10.3390/electronics13071199 (DOI)
10.1016/j.eswa.2023.119984 (DOI)

Files

Files (462.4 MB)

Name	Size	Download all
amazon5x5.tar.gz md5:45f86f5d9bfb4990003ea1146da98a51	413.2 MB	Download
bugs.tar.gz md5:a66446eb3d70df071d0089a3e6585f25	26.6 MB	Download
wos.tar.gz md5:840ae579c70e8e62675020f01d332c2e	22.5 MB	Download

Additional details

Is derived from: Journal article: 10.1016/j.eswa.2023.119984 (DOI)
Is described by: Journal article: 10.3390/electronics13071199 (DOI)
Is referenced by: Conference paper: https://aclanthology.org/2022.wnut-1.22 (URL); Software: https://gitlab.com/distration/dsi-nlp-publib/-/tree/main/htc-survey-22 (URL)

Repository URL: https://gitlab.com/distration/dsi-nlp-publib/-/tree/main/htc-survey-22

	All versions	This version
Views	230	230
Downloads	54	54
Data volume	7.3 GB	7.3 GB

Hierarchical Text Classification corpora

Creators

Description

Notes (English)

Files

Files (462.4 MB)

Additional details

Related works

Software