Published December 14, 2022 | Version 1.0
Dataset Open

Hierarchical Text Classification corpora

  • 1. Ca' Foscari University of Venice, Department of Environmental Sciences, Informatics and Statistics

Description

A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.

  • The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories.
  • The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component.
  • Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details).

Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.

{ "text": <article text>, "labels": [<label1>, <label2>, ...] }

The hierarchical structure of labels in each dataset is documented in this repository.

 

These datasets have been presented in this paper:

Some of these datasets have also been used in:

  • "Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984
  • "A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link

 

These datasets are partially derived from previous work, namely:

  • [Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018
  • [WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134
  • [Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511

Notes (English)

Please consider citing these papers if you use this data:

Files

Files (462.4 MB)

Name Size Download all
md5:45f86f5d9bfb4990003ea1146da98a51
413.2 MB Download
md5:a66446eb3d70df071d0089a3e6585f25
26.6 MB Download
md5:840ae579c70e8e62675020f01d332c2e
22.5 MB Download

Additional details

Related works

Is derived from
Journal article: 10.1016/j.eswa.2023.119984 (DOI)
Is described by
Journal article: 10.3390/electronics13071199 (DOI)
Is referenced by
Conference paper: https://aclanthology.org/2022.wnut-1.22 (URL)
Software: https://gitlab.com/distration/dsi-nlp-publib/-/tree/main/htc-survey-22 (URL)