Product Reviews for Ordinal Quantification

doi:10.5281/zenodo.8405476

Published October 4, 2023 | Version v0.2.1

Dataset Open

Product Reviews for Ordinal Quantification

1. TU Dortmund University
2. Consiglio Nazionale delle Ricerche

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.

The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are designed for the evaluation of quantification methods.

The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set.

The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation.

You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq

Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/

Files

amazon-oq-bk.zip

Files (41.9 GB)

Name	Size	Download all
amazon-oq-bk.zip md5:d20f31fa7ec7ce93c5c90f1143a7ca49	41.9 GB	Preview Download

Additional details

SoBigData-PlusPlus – SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics 871042: European Commission
AI4Media – A European Excellence Centre for Media, Society and Democracy 951911: European Commission

M. Bunse, A. Moreo, F. Sebastiani, M. Senz (2022). Ordinal Quantification through Regularization.
J. McAuley, C. Targett, Q. Shi, A. van den Hengel (2015). Image-based recommendations on styles and substitutes.

	All versions	This version
Views	257	93
Downloads	17	13
Data volume	652.4 GB	545.2 GB

Product Reviews for Ordinal Quantification

Files

amazon-oq-bk.zip

Files (41.9 GB)

Additional details

Funding

References

Product Reviews for Ordinal Quantification

Creators

Description

Files

amazon-oq-bk.zip

Files (41.9 GB)

Additional details

Funding

References