Published December 14, 2022
| Version 1.0
Dataset
Open
Hierarchical Text Classification corpora
- 1. Ca' Foscari University of Venice, Department of Environmental Sciences, Informatics and Statistics
Description
A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.
- The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories.
- The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component.
- Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details).
Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.
{ "text": <article text>, "labels": [<label1>, <label2>, ...] }
The hierarchical structure of labels in each dataset is documented in this repository.
These datasets have been presented in this paper:
- "Hierarchical Text Classification and its Foundations: a Review of Current Research" - DOI: 10.3390/electronics13071199
Some of these datasets have also been used in:
- "Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984
- "A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link
These datasets are partially derived from previous work, namely:
- [Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018
- [WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134
- [Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511
Notes (English)
Files
Files
(462.4 MB)
Name | Size | Download all |
---|---|---|
md5:45f86f5d9bfb4990003ea1146da98a51
|
413.2 MB | Download |
md5:a66446eb3d70df071d0089a3e6585f25
|
26.6 MB | Download |
md5:840ae579c70e8e62675020f01d332c2e
|
22.5 MB | Download |
Additional details
Related works
- Is derived from
- Journal article: 10.1016/j.eswa.2023.119984 (DOI)
- Is described by
- Journal article: 10.3390/electronics13071199 (DOI)
- Is referenced by
- Conference paper: https://aclanthology.org/2022.wnut-1.22 (URL)
- Software: https://gitlab.com/distration/dsi-nlp-publib/-/tree/main/htc-survey-22 (URL)