Product Reviews for Ordinal Quantification
- 1. TU Dortmund University
- 2. Consiglio Nazionale delle Ricerche
Description
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.
The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are designed for the evaluation of quantification methods.
The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set.
The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation.
You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.
Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq
Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
Files
amazon-oq-bk.zip
Files
(41.9 GB)
Name | Size | Download all |
---|---|---|
md5:d20f31fa7ec7ce93c5c90f1143a7ca49
|
41.9 GB | Preview Download |
Additional details
Funding
References
- M. Bunse, A. Moreo, F. Sebastiani, M. Senz (2022). Ordinal Quantification through Regularization.
- J. McAuley, C. Targett, Q. Shi, A. van den Hengel (2015). Image-based recommendations on styles and substitutes.