Multi-Label Quantification

Quantification, variously called"supervised prevalence estimation"or"learning to quantify", is the supervised learning task of generating predictors of the relative frequencies (a.k.a."prevalence values") of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but na\"ive, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the na\"ive approaches by a large margin. The code to reproduce all our experiments is available online.


Introduction
Many fields such as the social sciences, political science, market research, or epidemiology (to name a few), are inherently interested in aggregate data, i.e., in how populations of individuals are distributed according to one or more indicators of interest.Researchers who operate in these specialty areas are instead little interested in the individuals per se, since in these fields the individuals are relevant only inasmuch as they are members of the population of interest; in other words, disciplines such as the above are interested "not in the needle, but in the haystack" [1].
Sometimes, researchers active in these fields use supervised learning to obtain the data they need.For instance, epidemiologists interested in the distribution of the causes of death across different geographical regions may sometimes need to infer the cause of death of each person by classifying, via a machine-learned text classifier, a verbal description of the symptoms that affected a deceased person [2].However, these researchers are not specifically interested in the class (representing a given cause of death) to which an individual belongs; rather, their final goal is estimating the prevalence (i.e., relative frequency, or prior probability) of each class in the unlabelled data.
At a first glance, estimating these prevalence values via supervised learning looks like a direct application of classification, since one could simply (i) train a classifier on labelled data, (ii) use this classifier to issue label predictions for each unlabelled datapoint (i.e., individual) in the population of interest, (iii) count how many datapoints have been attributed to each of the classes of interest, and (iv) normalize the counts by the total number of labelled datapoints, thus obtaining the estimated relative frequencies of the classes.However, there is by now abundant evidence [3,4,5,6,7] that such an approach, known in the literature as the "Classify and Count" (CC) method, yields poor class prevalence estimates when the distribution of the unlabelled datapoints across the classes differs substantially from the analogous distribution observed during training.This latter condition is typically known as prior probability shift [8,9]; in the aforementioned disciplines this condition is ubiquitous since, quite obviously, there is interest in inferring a distribution only when we assume this unknown distribution to be possibly different from the known distribution that characterizes the training data.The main reason why CC tends to fail in the presence of prior probability shift is that, in this case, the IID assumption on which most classifiers based on supervised learning rest upon, does not hold.
Quantification (variously called learning to quantify, supervised prevalence estimation, or class prior estimation) is the research field concerned with obtaining accurate estimators of class prevalence values via machine learning [5].Given its obvious relationship with classification, quantification has been extensively studied in the two main settings typical of classification, i.e., the binary setting (binary quantification -BQ), in which the codeframe (i.e., the set of classes of interest) contains only two classes [10,4,11,12,13], and the single-label multiclass setting (single-label quantification -SLQ), which involves three or more mutually exclusive classes [14,15,16,6,7].Quantification has also been studied in other (less popular) settings, such as the ordinal case (ordinal quantification -OQ), in which there is a total ordering among the classes (see, e.g., [17,3,18]).
One important setting which remains to a large extent unexplored in the quantification literature is multi-label quantification (MLQ), the scenario in which every datapoint may belong to zero, one, or several classes at the same time.Multi-label data arise naturally in many applicative contexts, including the medical domain [19], the legal domain [20], industry [21], or cybersecurity [22], among many others, and has been thoroughly studied under the lens of classification in past literature [23,24,25], especially in the field of text classification [26,27,28,29]; see [30,31] for an overview of this topic.Despite the ubiquity of multi-label data, we are aware of only one previous attempt to cope with the MLQ problem [32]; in this paper we set out to analyze MLQ systematically.
We start by noting that, since quantification systems are expected to be robust to prior probability shift, we need to test them against datasets exhibiting substantial amounts of shift.Our first contribution is the first experimental protocol specifically designed for multi-label quantification, a protocol that guarantees that the data MLQ systems are tested against do comply with the above desideratum.
We carry on by noting that a trivial solution for MLQ could simply consist of training one independent binary quantifier for each of the classes in the codeframe.However, such a solution is arguably a "naïve" one, as it assumes the classes to be independent of each other, and thus does not attempt to leverage the class-class correlations, i.e., the stochastic dependencies that may exist among different classes.We show empirical evidence that multi-label quantifiers constructed according to this naïve intuition yield suboptimal performance, and that this happens independently of the method used for training the binary quantifiers.
We then move on to studying different possible strategies for tackling MLQ, and subdivide these strategies in four groups, based on their way of addressing (if at all) the multi-label nature of the problem.While the first two groups can be instantiated by using already available techniques, the other two cannot, since this would require "aggregation" techniques (see Section 5) that leverage the stochastic relations between classes, and no such method has been proposed before.We indeed propose two such methods, called RQ and LPQ.By means of extensive experiments that we have carried out using 15 publicly available datasets, we show that, when working in combination with a classifier that itself leverages the above-mentioned stochastic relations, LPQ and (especially) RQ outperform all other MLQ techniques.
The rest of this article is structured as follows.After devoting Section 2 to notation and preliminaries, in Section 3 we survey related work on quantification and on coping with multi-label data.In Section 4 we move on to propose an evaluation protocol specifically devised for MLQ.In Section 5 we characterize the four groups of MLQ systems according to the stages in which the multi-label nature of the problem is tackled, and propose the first methods (Section 5.2) that allow doing this also at the "aggregation" stage.In Section 6 we present the experiments we have carried out and discuss the results, while Section 7 wraps up.The code to reproduce all our experiments is available at https://github.com/manuel-francisco/quapy-ml/ .

Notation and Definitions
In this paper we use the following notation.By x we indicate a datapoint drawn from a domain X of datapoints, while by y we indicate a class drawn from a finite, predefined set of classes (also known as a codeframe) Y = {y 1 , ..., y n }, with n the number of classes of interest.Symbol σ denotes a sample, i.e., a non-empty set of (labelled or unlabelled) datapoints drawn from X .By p σ (y) we indicate the true prevalence of class y in sample σ, by pσ (y) we indicate an estimate of this prevalence, and by pq σ (y) we indicate the estimate of this prevalence obtained by means of quantification method q.We will denote by p = (p 1 , . . ., p n ) a real-valued vector.When p is a vector of class prevalence values, then p i is short for p σ (y i ).
We first formalize the SLQ problem (Section 2.1) and then propose a definition of the MLQ problem (Section 2.2).

Single-Label Codeframes
In single-label problems, each datapoint x belongs to one and only one class in Y.We denote a datapoint with its true class label as a pair (x, y), indicating that y ∈ Y is the true label of x ∈ X .We represent a set of k datapoints as {(x (i) , y (i) ) k i=1 : x (i) ∈ X , y (i) ∈ Y}.By L we denote a collection of labelled datapoints, that we typically use as a training set, while by U we denote a collection of unlabelled datapoints, that we typically use for testing purposes.
We define a single-label hard classifier as a function h : X → Y i.e., a predictor of the class attributed to a datapoint.We will instead take a single-label soft classifier to be a function s : X → ∆ n−1 with ∆ n−1 the unit (n-1)-simplex (aka probability simplex or standard simplex) defined as i.e., as the domain of all vectors representing probability distributions over Y.We define a single-label quantifier as a function q : 2 X → ∆ n−1 i.e., a function mapping samples drawn from X into probability distributions over Y.
Note that, despite the fact that the codomains of soft classifiers and quantifiers are the same, in the former case the i-th component of s(x) denotes the posterior probability Pr(y i |x), i.e., the probability that x belongs to class y i as estimated by s, while in the latter case it denotes the class prevalence value p σ (y i ) as estimated by q.
By d(p, p) we denote an evaluation measure for SLQ; these measures are typically divergences, i.e., functions that measure the amount of discrepancy between two probability distributions.Everything we say for single-label problems applies to the binary case as well, since the latter is the special case of the former in which n = 2, with one class typically acting as the "positive class", and the other as the "negative class".

Multi-Label Codeframes
In multi-label problems each datapoint x can belong to zero, one, or more than one class in Y.We denote a datapoint with its true labels as a pair (x, Y ), in which Y ⊆ Y is the set of true labels assigned to x ∈ X .A multi-label collection with k datapoints is represented as while we define a multi-label soft classifier as a function Note that, unlike in the single-label case, the codomain of function s is not a probability simplex, but the set of all real-valued vectors (p 1 , . . ., p n ) such that p i ∈ [0, 1], since constraint n i=1 p i = 1 is not enforced.We define a multi-label quantifier as a function q : 2 X → [0, 1] n i.e., a function mapping samples from X into vectors of n class prevalence values, where the class prevalence values in a vector do not need to sum up to 1.
3 Related Work

Multi-Label Quantification
The task of quantification was first proposed by Forman in 2005 [33], and emerges from the observation that in some applications of classification the real goal is the estimation of class prevalence values, and the prediction of individual class labels is nothing but an intermediate step to achieve it.Forman [33] observed that the so-called "classify and count" method (discussed in the introduction) tends to deliver biased estimators of class prevalence when confronted with situations characterized by prior probability shift.Since then, many contributions to this field have been published, describing new quantification methods (see [5] for an overview), measures for evaluating the accuracy of quantifiers [34,35], and protocols for the experimental evaluation of quantification systems [12,33,36].However, most of this work has focused on the binary [12,37,33,4,38], single-label multiclass [14,16,15,6], or ordinal [3,17,35,18] versions of the problem, with essentially no attention devoted to the multi-label case.
To the best of our knowledge, the only previously proposed method dealing with multi-label (text) quantification is [32].The method, dubbed PCC-PAV, mainly consists of performing a refined calibration of the posterior probabilities returned by a set of binary classifiers, each trained to issue soft predictions for a specific class in the codeframe, followed by the application of the (binary version of the) "probabilistic classify and count" (PCC) quantification method (explained in Section 5.1).The refinement amounts to constraining the calibration algorithm (Pair Adjacent Violators -PAV [39]) to generate posterior probabilities that have an expected value equal to the class prevalence values observed in the training set.This modification results in a complex quadratic programming problem that the authors try to alleviate by imposing a "document unification" heuristics (thus giving rise to the variant PCC-EPAV).However, aside from its substantial computational cost, the method presents two important limitations.The first limitation is that the calibration process encourages the quantifier to stick to the class prevalence values encountered in the training set (hereafter: the training prevalence), and thus makes it inadequate to tackle prior probability shift.Indeed, in [32] the authors test this method by means of an experimental protocol that offers no guarantee that the system is confronted with substantial amounts of prior probability shift; this is witnessed by the fact that the trivial baseline (one that simply returns the training prevalence for every test sample) achieves reasonably high scores.A second limitation is that the method uses independently trained binary classifiers; to put it another way, the method does not in any way address the multi-label nature of the problem, and is billed "multi-label" only since it is tested on multi-label datasets.This method thus squarely falls within the class of "naïve methods" that we have discussed in the introduction, and that we will more formally define as the simplest group of MLQ methods in the scheme that we propose in Section 5.2.

Multi-Label Classification
Apart from PCC-PAV and PCC-EPAV, no other published method explicitly addresses MLQ.However, the field of quantification is closely related to the field of classification, and many of the concepts and principles adopted in quantification have been borrowed from the classification literature.We will thus devote most of this section to review existing approaches for multi-label classification, since some among the methods we propose in Section 5.2 for MLQ are inspired by them.
In their survey [30], Tsoumakas and Katakis group the approaches to MLC in two main families: "problem transformation" approaches (Section 3.2.1)and "algorithm adaptation techniques" (Section 3.2.2).Later on, a third family of MLC approaches based on "ensemble methods" (Section 3.2.3)was introduced by Madjarov et al. [40].

Problem Transformation Approaches
The "problem transformation" approach consists of recasting the original multi-label classification problem as a single-label one, so that existing techniques for the single-label case can be applied directly.
The simplest approach relying on this principle is the so-called "binary relevance" (BR) approach [41,25].BR consists of treating each label independently, thus tackling the multi-label problem as a set of n binary classification problems.BR is a simplistic approach, since the stochastic dependencies among the classes are not taken into account.Despite its simplicity, BR has proven to work well in previous studies [40], and has always been, by far, the most frequently adopted approach for tackling multi-label classification problems.
The "label powerset" (LP) approach consists instead of transforming each unique assignment of labels, as found in the training datapoints, in a new label.For example, if a training datapoint x has labels Y = {y 1 , y 5 , y 6 }, with y 1 , y 5 , y 6 ∈ Y, then the datapoint is relabelled as Y = y 1:5:6 , with y 1:5:6 a new "synthetic" class belonging to a single-label codeframe Y ⊇ Y. Once this relabelling has been performed for all unique label assignments, the problem is treated as a single-label problem using Y as the codeframe.Although this approach models label dependencies [42], the number of possible classes in the new codeframe Y is exponential in the number of classes in the original codeframe, since there are actually 2 n possible assignments of a datapoint to classes in Y (hence the name of the approach).Even if not all label combinations occur in practice in a single dataset, the total number of different assignments that could be found when large multi-label codeframes are used could easily make the problem intractable.Some heuristics based on ensembles (discussed in Section 3.2.3)have been proposed to make the problem tractable.A further limitation of this approach is its low statistical robustness, since, once a training datapoint x with labels Y = {y 1 , y 5 , y 6 } is given the synthetic label y 1:5:6 ∈ Y , it "loses" the individual labels y 1 , y 5 , y 6 , which means that these latter classes end up having fewer training examples than in the standard BR approach.In other words, having more classes means having, on average, fewer training examples per class, which may bring about lower accuracy for the individual (non-synthetic) classes.Yet another problem of the LP approach is the fact that combinations of labels never found in the training set cannot be predicted at all.
The "classifier chains" (CCHAIN) approach [28,43] consists instead of training a sequence of n binary classifiers, one for each class in n, chained so that each classifier receives, as additional inputs, the predictions made by the previous classifiers in the chain.That is, the first classifier h 1 in the chain is trained to predict label y 1 using the original vector x as input, while the i-th classifier in the chain is trained to predict label y i using as inputs the original vector x concatenated with a vector of predictions (h 1 (x), . . ., h i−1 (x)) for labels y 1 , . . ., y i−1 , respectively.While this has turned out to be one of the best-performing MLC methods [40], CCHAIN presents a number of limitations.The first has to do with the fact that CCHAIN is sensitive to the order of the classes; in other words, Y is treated as an ordered sequence (rather than a set) of classes, and reshuffling the sequence would bring about different results, which is undesirable.Some approaches to counter this limitation consist of training an ensemble of CCHAINs using different label orderings in each of them, and then implementing some aggregation policy such as majority voting [28].Yet another limitation of CCHAIN is that the classification phase is (unlike for the simpler BR strategy) not amenable to parallelization, since the application of classifier h i to unlabelled data must occur strictly after the application of classifiers h 1 , ..., h i−1 .

Algorithm Adaptation Approaches
"Algorithm adaptation" approaches consist of adapting single-label classifiers to return multi-label predictions.Schapire and Singer proposed a multi-label adaptation of ADABOOST, called ADABOOST.MH [44], which is based on choosing, for a given iteration of the boosting process, a unique "pivot term" around which all the binary classifiers for the individual classes hinge.ML-KNN [24] is an adaptation of the well-known (lazy) KNN algorithm, which treats each class independently, and relies on the maximum a posteriori (MAP) principle to return multi-label predictions, where the prior and posterior probabilities needed to compute the MAP rule are estimated on the training set.Some attempts to adapt SVMs to the multi-label problem include RANKSVM [23,45], which solves an empirical risk minimization problem framed as a task of ranking involving n hyperplanes, and ML-TSVM [46,47], based on Twin SVMs (TSVMs) [48], i.e., SVMs that, instead of looking for one separation hyperplane, look for a pair of non-parallel separating hyperplanes.Rastogi et al. [49] noted that ML-TSVM does not actually take into account the class-class correlations, and propose a method called Multi-Label Minimax Probability Machine (ML-MPM), inspired by the "kernel trick" used in SVMs.ML-MPM deals with class-class correlations by imposing a regularization term on the predicted sets of classes ("labelsets"), where this term takes into account the label co-occurrence matrix as found in the training set.Other classical methods that have been used within algorithm adaptation approaches include decision trees (DT -see, e.g., [50]) and random forests (RF -see, e.g., [40]).
Benites et al. [51] proposed ML-ARAM and ML-FAM, two multi-label extensions of neural networks based on Adaptive Resonance Theory (ART).The methods work by inferring hierarchical relationships between classes in the codeframe.
Although these systems outperformed ML-KNN in the experimental evaluation, it was later noted that they suffer when facing datasets characterized by large codeframes and high-dimensional feature spaces [52], a common setting in text classification endeavours.A variant of ML-ARAM, called Hierarchical ARAM [52] was then proposed to mitigate this problem; hierarchical ARAM works by clustering the prototypes that ML-ARAM learns for the input datapoints, generating larger clusters that can be more efficiently accessed at inference time.

Ensemble-Based Approaches
Ensemble-based approaches to MLC aim at improving model performance by combining different base classifiers.Some ensembles have been proposed as a response to specific problems encountered in other MLC systems.
A first class of such ensembles are based on a divide-and-conquer approach, according to which the set of classes is first partitioned (using any clustering method), and the smaller, cluster-specific multi-label problems are then solved independently of each other.Different instances implementing this principle exist in the literature.For example, the authors of [53] create a tree in which the internal nodes are associated with sets of classes (called meta-classes) and their children are associated with subsets of the classes of the parent node.The root represents the set of all classes, while the leaves represent single classes.For each node, a multi-label classifier is trained to predict (zero, one, or more) meta-classes, each associated with one of the children nodes.The final set of labels returned is the union of all labels in the leaf nodes reached.HOMER implements the partitioning as balanced clustering, while HOMER-R relies instead on k-means; in both cases the smaller multi-label problems are tackled via the binary relevance approach discussed in Section 3.2.1.RAKEL [54] instead relies on random clustering (with the number of clusters a hyperparameter of the model) for partitioning the classes, and then runs an instance of the label powerset technique (LP -discussed in Section 3.2.1)for each cluster.Since the number of classes in each cluster is smaller than the total number of classes, the combinatorial problem affecting the LP approach is mitigated.In a similar vein, Label Space Clustering (LSC) [55] uses k-means for clustering, followed by an instance of ML-KNN local to each cluster.
Other methods rely instead on label embeddings, i.e., on representing each class by means of a low-dimensional dense vector in a continuous space, so that similar classes (i.e., classes that tend to co-occur with each other) tend to be close to each other in the embedding space.The Cost-Sensitive Label Embedding with Multidimensional Scaling (CLEMS) method [56] is one such example, and one that has proven to be among the best performers in the label embedding arena (see e.g., [57]).
Stacked Generalization (SG) [58] has often been employed to carry out multi-label classification.The idea is to train an ensemble of binary classifiers, each for a different class (somehow similarly to the BR approach) and use the classification predictions as additional features to train a meta-classifier.Some variants following this intuition include the FUN-TAT and FUN-KFCV [59] approaches for cross-lingual MLC, in which a language-agnostic meta-classifier receives as input the predictions returned by language-specific multi-label classifiers.An advantage of SG with respect to CCHAIN is that the former can be easily parallelized, since it is not dependent on the order of presentation of the classes.

An Evaluation Protocol for Testing Multi-Label Quantifiers
For the evaluation of quantifiers, researchers often use the same datasets that are elsewhere used for testing classifiers.On one hand this looks natural, because both classification and quantification deal with datapoints that belong to classes in a given codeframe.On the other hand this looks problematic, since classification deals with estimating class labels for individual datapoints while quantification deals with estimating class prevalence values for samples (sets) of such datapoints.Simply estimating the accuracy of a quantifier on the entire test set of a dataset used for classification purposes (hereafter: a "classification dataset") would not be enough, since this would be a single prediction only, which would be akin to testing a classifier on a single datapoint only.As a result, it is customary to generate a dataset to be used for quantification purposes (a "quantification dataset") from a classification dataset by extracting from the test set of the latter a number of samples than will form the test set of the quantification dataset.Exactly how these samples are extracted is specified by an evaluation protocol.Different evaluation protocols for the binary case [12,33], for the single-label multiclass case [36], and for the ordinal case [3], have been proposed in the quantification literature.
For the binary case, the most widely adopted protocol is the so-called artificial prevalence protocol (APP) [33].The APP consists of extracting many samples from a set of test datapoints at controlled prevalence values.The APP takes four parameters as input: the unlabelled collection U , the sample size k, the number of samples m to draw for each predefined vector of prevalence values, and a grid of prevalence values g (e.g., g = (0.0, 0.1, . . ., 0.9, 1.0)).We then generate all the vectors p = (p(⊕), p( )) of n = 2 prevalence values consisting of combinations of values from the grid g that represent valid distributions (i.e., such that the elements in p sum up to 1).For each such prevalence vector, we then draw m different samples of k elements each.The APP thus confronts the quantifier with scenarios characterized by class prevalence values very different from the ones seen during training.This protocol is, by far, the most popular one in the quantification literature (see, e.g., [33,37,38,60,10,61,62,6,7,63]).
For the single-label multiclass case (which is the closest to our concerns) the APP needs to take a slightly different form, since the number of vectors p = (p(y 1 ), ..., p(y n )) representing valid distributions for arbitrary n is combinatorially high, for any reasonable grid of class prevalence values.As a solution, one can generate a number of random points on the probability simplex, with the individual class prevalence values not even being constrained to lie on a predetermined grid; when this number is high enough, it probabilistically covers the entire spectrum of valid combinations.However, even this form of the APP is not directly applicable to the multi-label scenario, because in this latter the class prevalence values in a valid vector do not necessarily sum up to 1.One could attempt to simply treat the multi-label problem as a set of independent binary problems, and then apply the APP independently to each of the classes.Unfortunately, such a solution is impractical, for at least three reasons.
• The first reason is that the number of samples thus generated is exponential in n, since there are m|g| n such combinations.Note that n (the number of classes in the codeframe) cannot be set at will, and thus, in order to keep the number of combinations tractable in cases in which n is large (in our experiments we use datasets with up to n = 983 classes), one would be compelled to set m = 1 and choose a very coarse grid g of values (this would anyway prove insufficient when dealing with large codeframes).
• The second and perhaps most problematic reason is that, in any case, many of the combinations are not even realisable.That is, there may be prevalence vectors for which no sample could be drawn at all.To see why, assume that, among others, we have classes y 1 , y 2 , y 3 in our codeframe, and assume that in our test set U , every time a datapoint is labelled with y 1 it is also labelled with either y 2 or y 3 but not both.This means that all samples σ for which prevalence values p σ (y 1 ) = (p σ (y 2 ) + p σ (y 3 )) are requested, cannot be generated.
• Yet another reason why applying the APP would be, in any case, undesirable, is that the classes in most multi-label datasets typically follow a power-law distribution, i.e., there are few very popular classes and a long tail of many rare, or extremely rare, classes.The APP will sometimes impose high prevalence values for what in reality are very rare labels, which means that the sampling must be carried out with replacement, which generates samples consisting of many replicas of the same few datapoints, which is clearly undesirable.
For all these reasons we have designed a brand new protocol for MLQ, that we call ML-APP, since it is an adaptation of the APP to multi-label datasets.The protocol amounts to performing rounds of the APP, each targeting a specific class, but with the range of prevalence values explored for each class being limited by the amount of available positive examples.This allows all samples to be drawn without replacement.In each round, a class y i is actively sampled at controlled prevalence values while the prevalence values for the remaining classes are not predetermined.Pseudocode describing the ML-APP protocol is shown as Algorithm 1.
The ML-APP covers the entire spectrum of class prevalence values, by drawing without replacement, for every single class.This means that, for large enough classes, there will be samples for which the prevalence of the class exhibits a large prior probability shift with respect to the training prevalence, while for rare classes the amount of shift will be limited by the availability of positive examples.Note that, when actively sampling a class y i , any other class y j will co-occur with it with a probability that depends on the correlation between y i and y j .For cases in which the class y i being sampled is completely independent of the class y j , the samples generated will display a class prevalence for y j that is approximately similar to the prevalence of y j in U .In other words, samples generated via the ML-APP preserve the stochastic correlations between the classes while also exhibiting widely varying degrees of prior probability shift.Finally, note that the total number of samples that can be generated via the ML-APP can vary from dataset to dataset (even if they have the same number of classes), and depends on the actual number of positive instances for each class that are contained in the dataset.In any case, the maximum number of samples that can be generated via the ML-APP is bounded by mn|g|.

Performing Multi-Label Quantification
In this section we present the multi-label quantification methods that we will experimentally compare in Section 6.
Throughout this paper we will focus on aggregative quantification methods, i.e., methods that require all unlabelled datapoints to be classified (by a hard or a soft classifier, depending on the method) as an intermediate step, and that aggregate the individual (hard or soft) predictions in some way to generate the class prevalence estimates.The reason why we focus on aggregative methods is that they are by far the most popular quantification methods in the literature, and that this focus allows us an easier exposition.We will later show how the most interesting intuitions for performing MLQ that we discuss in this paper also apply to the non-aggregative case.
Algorithm 1 ML-APP protocol for multi-label data.
Input: U , a test collection Input: k, the sample size Input: m, the number of samples to draw for each prevalence Input: g, the grid of prevalence values S = {} for y i ∈ Y do // Split U in two sets, with U yi containing all datapoints with label y i and U y i // containing the others Before presenting truly multi-label quantifiers, though, we will introduce a number of single-label (aggregative) multiclass quantification methods from the literature, that will form the basis for our extensions to the MLQ case.

Single-Label Multiclass Quantification Methods
Classify and Count (CC), already hinted at in the introduction, is the naïve quantification method, and the one that is used as a baseline that all genuine quantification methods are supposed to beat.Given a hard classifier h and a sample σ, CC is formally defined as In other words, the prevalence of a class y i is estimated by classifying all unlabelled datapoints, counting the number of datapoints that have been assigned to y i , and dividing the result by the total number of datapoints.
The Adjusted Classify and Count (ACC) method (see [4,63]) attempts to correct the estimates returned by CC by relying on the law of total probability, according to which which can be more conveniently rewritten using matrix notation as where p CC σ is the vector representing the distribution across Y of the datapoints as estimated via CC, and matrix M h contains the misclassification rates of h, i.e., m ij is the probability that h will assign class y i to a datapoint whose true label is y j .Matrix M h is unknown, but can be estimated via k-fold cross-validation, or on a validation set.Vector p ACC σ is the true distribution; it is unknown, and the ACC method consists of estimating it by solving the system of linear equations of Equation 3 (see [64] for more on the multiclass version of ACC).
While CC and ACC rely on the crisp counts returned by a hard classifier h, it is possible to define variants of them that use instead the expected counts computed from the posterior probabilities returned by a calibrated probabilistic classifier s [11].This is the core idea behind Probabilistic Classify and Count (PCC) and Probabilistic Adjusted Classify and Count (PACC).PCC is defined as while PACC is defined as Equation 5 is identical to Equation 3, but for the fact that the leftmost part is replaced with the prevalence values estimated via PCC, and for the fact that the misclassification rates of the soft classifier s (i.e., the rates computed as expectations using the posterior probabilities) are used.
Methods CC, ACC, PCC, PACC, are sometimes collectively referred to as the "CC variants", and are all (as it is easy to verify) aggregative quantification methods.Although more sophisticated quantification systems have been proposed in the literature, the CC variants have recently been found to be competitive contenders when properly optimized [65].This, along with their simplicity, has motivated us to focus on the four CC variants as a first step towards devising multi-label quantifiers.
A further, very popular (aggregative) quantification method is the one proposed in [66], which is often called SLD, from the names of its proposers, and which was called EMQ in [16].SLD was the best performer in a recent data challenge centred on quantification [36], and consists of training a probabilistic classifier and then using the EM algorithm (i) to update the posterior probabilities that the classifier returns, and (ii) to re-estimate the class prevalence values of the test set.Steps (i) and (ii) are carried out in an iterative, mutually recursive way, until mutual consistency, defined as the situation in which is achieved for all y i ∈ Y.

Multi-Label Quantification
In this paper we will describe and compare many different (aggregative) MLQ methods.In order to better assess their relative merits, we subdivide them into four different groups, depending on whether the correlations between different classes are brought to bear in the classification phase (i.e., by the classifier on which an aggregative quantifier rests), or in the aggregation phase (i.e., in the phase in which the individual predictions are aggregated), or in both phases, or in neither of the two phases.
The first and simplest such group is that of MLQ methods that treat each class as completely independent, and thus solve n independent binary quantification problems.We call such an approach BC+BA ("binary classification followed by binary aggregation"), since in both the classification phase and the aggregation phase it looks at the multi-label task as n independent binary tasks, thus disregarding, in both phases, the correlations among classes when predicting their relative frequencies.This is akin to the binary relevance (BR) problem transformation described in Section 3.2.1 for classification, and consists of transforming the multi-label dataset L into a set of binary datasets L 1 , . . ., L n in which , since the datapoints are relabelled using the indicator function 1[z] that returns 1 (the "positive class") if z is true or 0 (the "negative class") otherwise.BC+BA methods then train one quantifier q i for each training set L i .At inference time, the prevalence vector for a given sample σ is computed as p BC+BA σ = (p q1 σ (1), p q2 σ (1), . . ., p qn σ (1)).Although this is technically a multi-label quantification method, BC+BA is actually the trivial solution that we expect any truly multi-label quantifier to improve upon.
A second, less trivial group is that of MLQ methods based on the use of binary aggregative quantifiers working on top of (truly) multi-label classifiers.Methods in this group consist of n independent binary aggregative quantifiers (e.g., built via one of the methods described in Section 5.1) that rely on the (hard or soft) predictions returned by a classifier natively designed to tackle the multi-label problem (e.g., built via one of the methods described in Section 3).Each binary quantifier takes into account only the predictions for its associated class, disregarding the predictions for the other classes.This represents a straightforward solution to the MLQ problem, as it simply combines already existing technologies (binary aggregative quantifiers built via off-the-shelf methods and (truly) multi-label classifiers built via   off-the-shelf methods).In such a setting, the classification stage is informed by the class-class correlations, but the quantification methods in charge of producing the class prevalence estimates for each class do not pay attention to any such correlation, and are disconnected from each other.Since methods in this group will consist of a (truly) multi-label classification phase followed by a binary quantification phase, we will refer to this group of methods as MLC+BA.
We next propose a third group of MLQ systems, i.e., ones consisting of natively multi-label quantification methods relying on the outputs of n independent binary classifiers.Methods like these represent a non-trivial novel solution for the field of quantification, because no natively multi-label quantification method has been proposed so far in the literature; in Section 5.2.1 we propose some such methods.In order to clearly evaluate the merits of such a multi-label aggregation phase, as the underlying classifiers we use independent binary classifiers only.For this reason, we will call this group of methods BC+MLA.
The fourth and last group of methods we consider amounts to combining any (truly) multi-label classification method with any (truly) multi-label quantification method among our newly proposed ones, thus bringing to bear the class dependencies both at the classification stage and at the aggregation stage.We call this group of methods MLC+MLA.
Figure 1 illustrates in diagrammatic form the four types of multi-label quantification methods we study in this paper.
In order to generate members of these four classes, we already have off-the-shelf components for implementing the binary classification, multi-label classification, and binary aggregation phases, but we have no known method from the literature to implement multi-label aggregation; Sections 5.2.1 and 5.2.2 are devoted to proposing two novel such methods.

Bringing to Bear Class-Class Correlations at the Aggregation Stage via Regression
Let us assume we have a multi-label quantifier q of type BC+BA or MLC+BA.Our idea is to detect how quantifier q fails in capturing the correlations between classes, and correct q accordingly.This is somehow similar to the type of correction implemented in ACC and PACC.However, we will formalize this intuition as a general regression problem, thus not necessarily assuming this correction to be linear (as ACC and PACC instead do).
More concretely, we split our training set L into two parts, L Q (that we use for training our quantifier q) and L R (that we use for training a regressor r, i.e., a function r : R n → R n ). 1 We then use the ML-APP protocol described in Section 4 to extract, from set L R , a set R = {σ i ∼ ML-APP(L R , k, m, g)} of l samples, where k (sample size), m (number of samples to draw for each prevalence value on the grid), and g (grid of prevalence values) are the parameters of the ML-APP protocol.Having done this, we first train our quantifier q on L Q ; since q is a multi-label quantifier, it is a function that, given a sample σ, returns a vector pq σ of n class prevalence values, not necessarily summing up to 1.We then apply q to all the samples in R; as a result, for each sample σ i ∈ R we have a pair (p q σi , p σi ), where pq σi is the vector of the n prevalence values estimated by q, and p σi is the vector of the n true prevalence values.We use this set of l pairs as the training set for training a multi-output regressor r : R n → R n that takes as input a vector of n "uncorrected" prevalence values and returns a vector of n "corrected" prevalence values; for training the regressor we use any off-the-shelf multi-output regression algorithm.Note that the regressor indeed captures the correlations between classes, since it receives as input, for each sample, the class prevalence estimates for all the n classes altogether. 3t inference time, given an (unlabelled) sample σ, we first obtain a preliminary estimate of the class prevalence values pq σ via q, and then apply the correction learned by r, thus computing pr σ = r(p q σ ).We then normalize, via clipping,4 every prevalence value in pr σ so that it falls in the [0, 1] interval, and return the estimate.The method (which we here call RQ, for "regression-based quantification") is described succinctly as Algorithm 2 (training phase) and Algorithm 3 (inference phase).

Algorithm 2 MLQ correction via regression: Training
Input: L, a training collection Input: q, a multi-label quantifier Input: r, a multi-output regressor Input: k, m, g, parameters of the ML-APP

Algorithm 3 MLQ correction via regression: Inference
Input: σ, an unlabelled sample Input: q, a trained multi-label quantifier Input: r, a trained multi-output regressor Return: pr σ As noted above, the regressor exploits the class-class correlations during the aggregation phase.This means that, according to the subdivision of MLQ methods illustrated in Table 1, the addition of a regression layer on top of an existing quantifier q has the effect of transforming a BC+BA method into a BC+MLA method, or of transforming a MLC+BA method into a MLC+MLA method.

Bringing to Bear Class-Class Correlations at the Aggregation Stage via Label Powersets
We investigate an alternative way of modelling class-class correlations at the quantification level, this time by gaining inspiration from label powersets (LPs -see Section 3.2.1)and the heuristics for making their application tractable (Section 3.2.3).
LP is a problem transformation technique devised for transforming any multi-label classification problem into a single-label one by replacing the original codeframe with another one that encodes subsets of this codeframe into "synthetic" classes (see Section 3.2.1 for details).This problem transformation is directly applicable to the case of quantification as well.Of course, the combinatorial explosion of the number of synthetic classes has to be controlled somehow but, fortunately enough, the same heuristics investigated for MLC can come to the rescue here.
Our method (which we here call LPQ, for "label powerset -based quantification") consists of generating, by means of any existing clustering algorithm, a set C of (non-overlapping) clusters consisting of few classes each, before applying the LP strategy, so that the number of possible synthetic classes remains under reasonable bounds.For example, if our codeframe has n = 100 classes, extracting 25 clusters of 4 classes each results in the maximum possible number of synthetic classes being 25 • 2 4 = 400, which is much smaller than the original 2 100 .We perform this clustering by treating classes in Y as instances and training datapoints as features, so that a class is represented by a binary vector of datapoints, where 1 indicates that the datapoint belongs to the class and 0 that it does not.The clustering algorithm is thus expected to put classes displaying similar assignment patterns (i.e., that tend to label the same documents) in the same cluster.
Once we have performed the clustering, given the set of classes Y c ⊆ Y contained in each cluster c ∈ C, we need to take the single-label codeframe Y c determined from the 2 Y → Y multi-label-to-single-label mapping (a mapping that, e.g., would attribute to the set of classes {y 1 , y 5 , y 6 } ⊆ Y c the synthetic class y 1:5:6 ∈ Y c ) and train a single-label quantifier on it; this needs to be repeated for each cluster.At inference time, in order to provide class prevalence estimates for the classes in Y c from the predictions made for the classes in Y c by the above-mentioned quantifier, we have to "reverse" the multi-label-to-single-label mapping, so that the estimated prevalence value of y i ∈ Y c is the sum of the estimated prevalence values of all labels y ∈ Y c that involve y i ; performing this for each cluster c ∈ C returns prevalence estimates for all classes y i ∈ Y.
More formally, let us define a matrix A that records the label assignment in cluster c, so that a ij = 1 if the set of classes represented by the synthetic class y i ∈ Y c contains class y j ∈ Y c , and a ij = 0 if this is not the case.Note that A has as many rows as there are classes in Y c and as many columns as there are classes in Y c .Once our single-label quantifier q produces an output pq σ , we only need to compute the product (p q σ ) A to obtain the vector of prevalence estimates for the classes in Y c .Performing all this for each cluster c ∈ C returns prevalence estimates for all classes y i ∈ Y.The example shown in Figure 2   (p q σ (y 1 ), pq σ (y 2 ), pq σ (y 3 )) = (p q σ ) A pq σ (y 1 ) = pq σ (y 1 ) + pq σ (y 1:2 ) + pq σ (y 1:3 ) + pq σ (y 1:2:3 ) = 0.30 pq σ (y 2 ) = pq σ (y 2 ) + pq σ (y 1:2 ) + pq σ (y 2:3 ) + pq σ (y 1:2:3 ) = 0.57 pq σ (y 3 ) = pq σ (y 3 ) + pq σ (y 1:3 ) + pq σ (y 2:3 ) + pq σ (y 1:2:3 ) = 0.43 In principle, the disadvantage of this method is that it cannot learn the correlations between classes that belong to different clusters.However, the method is based on the intuition that classes that are indeed correlated tend to end up in the same cluster, and that the inability to model correlations between classes that belong to different clusters will be more than compensated by the reduction in the number of combinations that one needs to take into account.
In the experiments of Section 6.5 we explore different configurations of this approach, in which we combine different clustering strategies.

Experiments
In this section we turn to describing the experiments we have carried out in order to evaluate the performance of the different methods for MLQ that we have presented in the previous sections.In Section 6.1 we discuss the evaluation measure we adopt, while in Section 6.2 we describe the datasets on which we perform our experiments.In Section 6.3 we report experiments aiming at comparing the four groups of methods discussed in Section 5.2 and illustrated in Figure 1.In Section 6.4 and 6.5 we then move on to exploring further instances of methods belonging to those four groups.

Evaluation Measures
Any evaluation measure for binary quantification can be easily turned into an evaluation measure for multi-label quantification, since evaluating a multi-label quantifier can be done by evaluating how well the prevalence value p(y i ) of each class y i ∈ |Y| is approximated by the prediction p(y i ).As a result, it is natural to take a standard measure d(p, p) for the evaluation of binary quantification, and turn it into a measure for the evaluation of multi-label quantification.(This is exactly what we do in multi-label classification, in which we take F 1 , a standard measure for the evaluation of binary classification, and turn it into macroaveraged F 1 , which is the standard measure for the evaluation of multi-label classification.) The study of evaluation measures for binary (and single-label multiclass) quantification performed in [34] concludes that the most satisfactory such measures are absolute error and relative absolute error; these are the two measures that we are going to use in this paper.In the binary case, absolute error is defined as which yields the multi-label version In the binary case, relative absolute error is instead defined as rae(p, p) = 1 2 which yields the multi-label version Since RAE is undefined when p i = 0 or p i = 1, we smooth the probability distributions p and p via additive smoothing; in the binary case, this maps a distribution p = (p i , (1 with the smoothing factor, that we set, following [4], to = (2|σ|) −1 .
Note that we do not use, as a measure, concordance ratio, i.e., despite the fact that it is the measure used in [32], the only paper in the literature that addresses multi-label quantification.
The reason why we do not use it is the fact that, as later shown in [34], the mathematical properties of CR do not make it (similarly to other measures used in the quantification literature in the past, such as the Kullback-Leibler Divergence) a satisfactory measure for quantification; see [34, pp. 272-273] for details.
In the experiments we describe in Section 6, the trends we observe and the conclusions we draw for AE hold for RAE as well.In Section 6 we will thus report our results in terms of AE only, deferring the results in terms of RAE to Appendix A.

Datasets
For our experiments we use 15 popular MLC datasets, including 3 datasets specific to text classification (Reuters-21578, 5 Ohsumed [69], and RCV1-v2 6 ), plus all the datasets linked from the SCIKIT-MULTILEARN package [57] with the exception of the RCV1-v2 subsets (we omit them since we already include the much larger collection from which they were extracted).We refer to the original sources for detailed descriptions of these datasets. 7or the three textual datasets, we pre-process the text by applying lowercasing, stop word removal, and punctuation removal, as implemented in SCIKIT-LEARN, 8 and by masking numbers with a special token.We retain all terms appearing at least 5 times in the training set, and convert the resulting set of words into (sparse) tfidf-weighted vectors using SCIKIT-LEARN's default vectorizer. 9or all datasets, we remove very rare classes (i.e., those with fewer than 5 training examples) from consideration, since they pose a problem when it comes to generating validation (i.e., held-out data) sets.Indeed, since we optimize the hyperparameters for all the methods we use (as explained below), we need validation sets, and it is sometimes impossible to have positive examples for these classes in both the training and validation sets (let us remember that pure stratification in multi-label datasets is not always achievable, as argued in [67,68]).Note that all this only concerns the training set, and has nothing to do with the test set, which can include (and indeed includes, for most datasets) extremely rare classes, since removing classes that are rare in the test set would lead to an unrealistic experimentation.Note also that removing classes that are rare in the training set is "fair", i.e., equally affects all methods that we experimentally compare, since all of them involve hyperparameter optimization.Finally, note that, whenever a method requires generating additional (and maybe nested) validation sets, it is inevitably exposed to the problems mentioned above, and can thus be at a disadvantage with respect to other methods that do not require additional validation data.Table 1 shows a complete description of the datasets we use (after deleting rare classes), along with some useful statistics proposed in [70,71], while Figure 3 shows the distribution of prevalence values for each dataset.Note that, in most datasets, this distribution obeys a power law.
We set the parameters of the ML-APP for generating test samples (see Section 4) as follows.We fix the sample size to k = 100 in all cases.We set the grid of prevalence values to g = {0.00,0.01, . . ., 0.99, 1.00} in all cases but for dataset Delicious, since in this latter the number of combinations thus generated would be intractable, given that this is dataset with no fewer than 983 classes; for Delicious we use the coarser-grained grid g = {0.00,0.05, . . ., 0.95, 1.00}.We set m (the number of samples to be drawn for each prevalence value) independently for each dataset, to the smallest number that yields more than 10,000 test samples (m ranges from 1 in Delicious to 40 in Birds).
We break down the results into three groups, each corresponding to a different amount of shift.The rationale behind this choice is to allow for a more meaningful analysis of the quantifiers' performance, since the APP (and, by extension, the ML-APP) has often been the subject of criticism for generating samples exhibiting degrees of shift that are judged unrealistic and unlikely to occur in real cases [12,72].We instead believe that general-purpose quantification methods should be tested in widely varying situations, from low-shift to high-shift ones, and we thus prefer to test all such scenarios, but split the corresponding results into groups characterized by of more or less homogeneous amounts of shift.
More specifically, for each test sample generated via the ML-APP, we compute its prior probability shift with respect to the training set in terms of AE between the vectors of training and test class prevalence values.We then bring together all the resulting shift values and split the range of such values in three equally-sized intervals (that we dub low shift, mid shift, and high shift).The accuracy values we report are thus not averages across the same number of experiments, since the ML-APP often tends to generate more samples in the low-shift region than samples in the mid-shift region and (above all) in the high-shift region.The number of samples, as well as the distribution of shift values, depends on each dataset.
Figure 4 shows the distributions of shift values that the ML-APP generates (blue) along with the distributions of shift values that we would obtain via uniform sampling (red).Note that the ML-APP succeeds in generating larger amounts of shift, and that most of the samples generated via uniform sampling would fall within what we call the "low shift" region.Figure 4: Shifts generated via the proposed ML-APP (blue) and via uniform sampling (blue), as computed in terms of AE between the training set and the test samples.

Testing Instances of the Four Types of Multi-Label Quantification Methods
The goal of this section is to provide an answer to the question: "Which among the four groups of multi-label quantification methods tends to perform best?" To this aim, we choose one representative instance from each group, and carry out the experiments using all the datasets.We perform this choice by combining the following components: • As the binary classification method, we choose logistic regression (LR), and use the implementation of it available from SCIKIT-LEARN. 10We consider LR a good choice, given that it is a probabilistic classifier that already provides fairly well calibrated posterior probabilities (which is of fundamental importance in PCC, PACC, and SLD), and given that, as indicated by previously reported results [73], it tends to perform well.A set of LR classifiers are used when testing the binary relevance (BR) method described in Section 3.2.2.
• As the multi-label classification method, we adopt stacked generalization [58] (SG -see Section 3.2.3).We use our own implementation (since the implementation of stacked generalization available from SCIKIT-LEARN only caters for the single-label case) 11 , that relies on 5-fold cross-validation to generate the intermediate representations (in the form of posterior probabilities) given as input to the meta-learner, concatenated with the original input features.The base members of the ensemble consist of binary logistic regression classifiers as implemented in SCIKIT-LEARN.
• As the binary aggregation method Q, we experiment with all the methods covered in Section 5.1, i.e., CC, PCC, ACC, PACC, SLD.For all these methods we use the implementations made available in the QUAPY open-source library [73]. 12 As the multi-label aggregation method, we use the regressor-based strategy for quantification (that we dub RQ) described in Section 5.2.1.We implement this method as part of the QUAPY framework.For training the base quantifier q we experiment again with all the methods covered in Section 5.1, i.e., CC, PCC, ACC, PACC, SLD, while as the internal regressor which receives its input from the base quantifier q we use linear support vector regression (SVR), for which we use the SCIKIT-LEARN implementation. 13As the held-out validation set L R needed for training the regressor we use a set consisting of 40% of the training datapoints, chosen via iterative stratification [68,67] as implemented in SCIKIT-MULTILEARN. 14We call this aggregation method SVR-RQ.
The methods we use in this experiment thus amount to the combinations illustrated in Table 2. .We use L tr to train the quantifiers with different combinations of hyperparameters, while from L va we extract, via the ML-APP, validation samples on which we assess, via AE (the same measure we use in the evaluation phase), the quality of the hyperparameter combinations.We explore the hyperparameters via grid-search optimization, and use the best configuration to retrain the quantifier on the entire training set L after model selection.During the model selection phase, for the ML-APP we use the same parameters k and g that we use in the test phase, but we reduce the number of repetitions m to 5 in the datasets with fewer than 90 classes, and to 1 in the other datasets, in order to keep the computational burden under reasonable bounds.
The hyperparameters we explore for LR include C, the inverse of the regularization strength, in the range {10 −1 , 10 0 , 10 1 , 10 2 , 10 3 }, and ClassWeight, which takes values in Balanced (which reweights the importance of the examples so as to equate the overall contribution of each class) or None (which gives the same weight to all datapoints, irrespectively of the prevalence of the class they belong to).In cases in which the class-specific classifiers are independent of each other (i.e., for methods belonging to the BC+BA and BC+MLA types) we optimize the hyperparameters independently for each class.For SG, we only optimize the hyperparameters of the meta-classifier, leaving the hyperparameters of the base members to their default values.In particular, we explore the parameters C and ClassWeight as before, plus the hyperparameter Normalize, which takes values in True (which has the effect of standardizing the inputs of the meta-classifier so that every dimension has zero mean and unit variance) and False (which does not standardize the inputs).For RQ we only explore the regularization hyperparameter C in the range {10 −1 , 10 0 , 10 1 , 10 2 , 10 3 }.Note that the base quantifiers (i.e., CC, PCC, ACC, PACC, SLD) have no specific internal hyperparameters to be tuned.
The results we have obtained for the different choices of the base quantifier are reported in Table 3 for CC, Table 4 for PCC, Table 5 for ACC, Table 6 for PACC, and Table 7 for SLD.The results clearly show (see especially the last two rows of each table) that there is an ordering BC+BA ≺ MLC+BA ≺ BC+MLA ≺ MLC+MLA, in which ≺ means "performs worse than", which holds, independently of the base quantifier of choice, in almost all cases.
The same experiments also indicate that there is a substantial improvement in performance that derives from simply replacing the binary classifiers with one multi-label classifier (moving from BC+BA to MLC+BA or from BC+MLA to MLC+MLA), i.e., from bringing to bear the class-class correlations at the classification stage, and that there is an equally substantial improvement when binary aggregation is replaced by multi-label aggregation (switching from BC+BA to BC+MLA or from MLC+BA to MLC+MLA), i.e., when the class-class correlations are exploited at the aggregation stage.What also emerges from these results is that, consistently with the above observations, the best-performing group of methods is MLC+MLA, i.e., methods that explicitly take class dependencies into account both at the classification stage and at the aggregation stage.
Note that methods that learn from the stochastic correlations among the classes perform much better than methods that do not, even in the low shift regime.Overall, the best-performing method on average is MLC+MLA when equipped with PCC as the base quantifier.
The reader might wonder why we do not use as a baseline the system presented in the only paper in the literature that tackles multi-label quantification, i.e., [32].There are several reasons for this: (a) the authors do not make the code available; (b) the method is, as already discussed in Section 3.1, computationally expensive, and as a result the authors test it on a single dataset whose codeframe consists of 16 classes only; using this method on our 15 datasets, whose codeframes count up to 983 classes, and 125 classes on average, would be prohibitive; (c) the method is essentially a calibration strategy for binary classification, which means that it falls in the group of "naive" BC+BA methods since it does not tackle at all, as already mentioned in Section 3.1, the multi-label nature of the MLQ problem.In the following sections, we turn to explore other instances of methods of the four groups in Figure 1 beyond the ones we choose in Table 2.

Testing Additional Instances of MLC+BA
In this section we explore other methods relying on different multi-label classifiers, with the aim of studying the extent to which the results we have obtained in the previous experiments depend on the choice of the classifier being employed.
To this aim, we focus on the MLC+BA group of methods, so that the aggregation stage plays only a minimal role.As a quantification method we adopt PCC, since this is the base quantifier that has yielded the best performance overall in the experiments of Section 6.3.The methods we study here thus consist of genuinely multi-label classifiers that generate posterior probabilities, where the latter are then aggregated by computing the expected value for each class.The aim of this experiment is not to provide an exhaustive evaluation of existing multi-label classifiers, but rather to study other MLC+BA configurations in action, and hopefully pinpoint interesting performance trends.
With this in mind, we choose some representative instances from the main families of multi-label classifiers discussed in Section 3. The multi-label classifiers we consider here include (a) multi-label versions of KNN (ML-KNN), 15decision trees (DT), 16 and random forests (RF) 17 as representatives of the family of "algorithm adaptation" methods (Section 3.2.2);(b) classifier chains (CCHAINS) 18 as representative of the family of "problem transformation" methods (Section 3.2.1);and (c) CLEMS 19 and label space clustering (LSC) 20 as representatives of the family of "ensemble" methods (Section 3.2.3).For the sake of comparison, we also include stacked generalization (SG -another ensemble method and the multi-label classifier we choose for the experiments of Section 6.3), and BC+BA (with PCC as the binary quantifier and LR as the binary classifier) acting as a lower bound baseline.Values for SG and BC+BA are taken from Table 4.We carry out model selection by optimizing the hyperparameters listed succinctly in Table 8.The results we have obtained are presented in Table 9.
These results reveal that, despite the fact that SG is the best-performing method, other multi-label classifiers work comparably well and could be used to yield multi-label aggregative quantifiers with similar performance levels.In particular, CCHAINS tends to fare very well in all cases, followed by RF and DT; these results are, by and large, consistent with those reported in [40].The methods ML-KNN, CLEMS, and LSC, however, prove inferior, sometimes performing even worse than the BC+BA baseline.
Although the results we report in Table 9 are obtained on the test set, we confirm that they are strongly correlated with the performance levels we measured on the held-out validation set.Indeed, we chose SG as our multi-label classifier for the experiments of Section 6.3 since this was the model yielding the lowest AE during model selection.

Testing Additional Instances of BC+MLA
In this section we compare the different multi-label aggregation strategies proposed in Section 5.2.1 and 5.2.2.In order to do so, we focus on the BC+MLA group of methods (i.e., those relying on binary classifiers for the label predictions) so that all the label dependencies are modelled exclusively at the aggregation stage.
For the label powerset -based strategy (LPQ) we consider two different ways for generating the clusters, after which SLQ is applied to the resulting label powersets of each cluster.In particular, we investigate: • RAKEL-LPQ: inspired by RAKEL [54]; generates k disjoint random clusters; • KMEANS-LPQ: inspired by LSC [55]; generates clusters via k-means.
For the regression-based (RQ) strategy we consider two alternative regressors (other results exploring further regression algorithms can be found in Appendix B): • RIDGE-RQ: using ridge regression; 21• RF-RQ: using random-forest regression.For the sake of comparison, we add the regression-based strategy SVR-RQ (Section 5.2.1), that corresponds to our configuration of choice for BC+MLA in Section 6.3, and the BC+BA system (PCC+LR) as a lower-bound baseline.
The results for SVR-RQ and BC+BA are taken from Table 4. Model selection is carried out by exploring, via grid-search optimization, the hyperparameters indicated in Table 10.The results we have obtained are shown in Table 11.
These results show that all the multi-label aggregation methods perform comparably in the low-shift regime, although the LP-based methods tend to perform slightly better.In the mid-shift and high-shift regimes the regression-based  The most important observation we can draw from this table is that all these methods tend to outperform not only the BC+BA system (as expected) but also all the variants from the MLC+BA group explored in Section 6.4, which may be an indication that in MLQ, bringing to bear the stochastic correlations among classes at the aggregation phase is more effective than doing so at the classification phase.Since the methods of type MLC+MLA that we have proposed in this paper have proven to be the most effective in all our experiments, we want to add an important observation about them.
Concerning our regression-based RQ method described in Section 5.2.1, although we have assumed, for ease of exposition, that the quantifier q is an aggregative one, this assumption is not strictly necessary, since the regressor r does not look at predicted class labels for individual datapoints, but only at the class prevalence estimates returned by the underlying quantifier q.A similar observation can be made for our label powerset -based LPQ method described in Section 5.2.2; this method leverages a single-label multiclass quantifier q and uses its class prevalence estimates, and does not require any prediction at the level of the individual datapoint, which means that aggregative methods and non-aggregative methods are equally suitable for training q.In other words, both RQ and LPQ can use any type of quantification method, aggregative or non-aggregative.
The reasons why in this paper we have focused on aggregative quantifiers are (i) ease of explanation, and (b) the fact that, as a recent large-scale experimental study has confirmed [74], non-aggregative quantification methods (such as the HDx method of [75]) are, from the point of view of sheer performance, not yet on a par with aggregative methods.However, the above observations indicate that, should high-performance non-aggregative quantification methods spring up in the future, RQ and LPQ can be used in connection with them straightaway.

Conclusions
In this paper we have investigated MLQ, a quantification task which had remained, since the origins of quantification research, essentially unexplored.We have proposed the first protocol for the evaluation of MLQ systems that is able to confront these systems with samples that exhibit from low to high levels of prior probability shift.For ease of exposition we have particularly focused on multi-label quantifiers that work by aggregating predictions for individual datapoints issued by a classifier ("aggregative" multi-label quantifiers), and have subdivided them in four groups, based on whether the correlations between classes are brought to bear in the classification stage (MLC+BA), in the quantification stage (BC+MLA), in both stages (MLC+MLA), or in neither of the two stages (BC+BA).
We have also described and experimentally compared a number of MLQ methods; some of them (specifically: those in the BC+BA and MLC+BA groups) are trivial combinations of available classification and quantification methods, while others (specifically: those in the BC+MLA and MLC+MLA groups) are non-obvious, and proposed here for the first time.The thorough experimentation that we have carried out on an extensive number of datasets has clearly shown that there is a substantial improvement in performance that derives from simply replacing binary classifiers with truly multi-label classifiers (i.e., from switching from BC to MLC), and that there is an equally substantial improvement when binary aggregation is replaced by truly multi-label aggregation (i.e., when switching from BA to MLA).Consistently with these two intuitions, MLC+MLA methods unequivocally prove the best of the lot; of the two MLC+MLA methods we have proposed, RQ proves clearly superior to LPQ.In the light of this superiority of MLA with respect to BA, it is also interesting that both RQ and LPQ can be straightforwardly used in association to non-aggregative quantifiers too.

A Evaluation in Terms of Relative Absolute Error
While the tables presented in the main body of the paper report the results of our experiments in terms of the absolute error (AE) measure, in this section we present, for the sake of completeness, the results in terms of relative absolute error (RAE).We do not comment on these results since the trends that emerge from them are essentially the same as for the AE measure.Tables 12

B Exploring other Regressors in RQ
We here report additional experiments that extend the ones presented in Section 6.5; the present experiments concern the use of regression algorithms other than ridge regression (Ridge-RQ), random forest regression (RF-RQ), and linear SVR (SVR-RQ), which were the only regression algorithms we considered in Section 6.5.Tables 19 and 20 report the results of these experiments in terms of AE and RAE, respectively, obtained by optimizing the hyperparameters shown in Table 21.These results show that there are no substantial differences in performance for the low-shift regime, while these differences are instead noticeable in the mid-shift and (especially) in the high-shift regimes.These results are well correlated with the results we obtained in the validation phase, on which we relied upon for choosing ridge regression, random forest regression, and linear SVR, as representative regression models.

Figure 1 :
Figure 1: The four groups of multi-label quantification methods.Dotted lines connecting class labels with a model (classifier or quantifier) indicate that the model learns from (or has access to) the class labels of the training datapoints.Solid lines connecting classifiers with quantifiers indicate a transfer of outputs from the classifier to the quantifier.With a slight deviation from our notation, here h denotes any classifier, hard or soft. y may clarify things.

Figure 2 :
Figure 2: An example considering a cluster made of three classes only (left), and the computations carried out for reconstructing the prevalence values for the original multi-label codeframe (right).
Compute the number of positives and negatives to extract POS ← k • g j NEG ← k−POS // Generate samples only if the number of datapoints in U yi allows // the sampling to be performed without replacement if |U yi | ≥ POS then for m repetitions do draw σ yi from U yi , with |σ yi | =POS, uniformly at random w/o replacement draw σ y i from U y i , with |σ y i | =NEG, uniformly at random w/o replacement // Members of σ yi and σ y i are not removed from U yi or U y i σ ← σ yi ∪ σ y i // Note that p σ (y i ) = g j ; the prevalence for the other // classes is not predetermined S ← S ∪ {σ}

Table 1 :
Description of the datasets.Columns #Classes, #Train, and #Test indicate the number of classes, training datapoints, and test datapoints, respectively.Label cardinality(Card)reports the mean number of labels per datapoint.Label density (Dens) is the result of dividing the label cardinality by the total number of labels.Label diversity (Div) is the number of unique labelsets that are present in the dataset.Normalised label diversity (NormDiv) reports the ratio between label diversity and the total number of labels.The proportion of unique label combinations (PUniq) is the total number of labelsets that are unique in the dataset, divided by the number of examples.PMax reports the ratio of datapoints with the most frequent labelset divided by the total number of datapoints.
Histograms of class prevalence values, one per dataset, sorted from highly populated datasets to lowly populated ones; values on the x axis indicate intervals [α k , β k ] of class prevalence values, while values on the y axis indicate the fraction of classes in the dataset that have prevalence values in the [α k , β k ] interval.

Table 2 :
Methods we use as instances of the four types of methods illustrated in Figure1., we perform model selection by using, as the loss function to minimize, a quantification-oriented error measure (and not a classification-oriented one), and by adopting the same protocol used for the evaluation of our quantifiers.That is, model selection is carried out by first splitting the training set L into two disjoint sets, i.e., (a) a proper training set L tr and (b) a held-out validation set L va consisting of 40% of the labelled datapoints.For splitting the training set, we again rely on the iterative stratification routine of SCIKIT-MULTILEARN (seeFootnote 1)

Table 3 :
Values of AE obtained in our experiments for different amounts of shift using CC as the base quantifier.The number of test samples generated for each dataset exceeds 10,000, though there is a variable number of samples allocated in each region of shift.Boldface indicates the best method for a given dataset and shift region.Superscripts † and ‡ denote the methods (if any) whose scores are not statistically significantly different from the best one according to a Wilcoxon signed-rank test at different confidence levels: symbol † indicates 0.001 < p-value < 0.05 while symbol ‡ indicates 0.05 ≤ p-value.For ease of readability, for each pair {dataset, shift} we colour-code cells via intense green for the best result, intense red for the worst result, and an interpolated tone for the scores in-between.

Table 4 :
Values of AE obtained in our experiments for different amounts of shift using PCC as the base quantifier.Notational conventions are as in Table3. 22

Table 5 :
Values of AE obtained in our experiments for different amounts of shift using ACC as the base quantifier.Notational conventions are as in Table3.

Table 6 :
Values of AE obtained in our experiments for different amounts of shift using PACC as the base quantifier.Notational conventions are as in Table3.

Table 7 :
Values of AE obtained in our experiments for different amounts of shift using SLD as the base quantifier.Notational conventions are as in Table3.

Table 8 :
Hyperparameters we explore during model selection for different multi-label classifiers., . . ., 10 2 , 10 3 } ClassWeight weights associated with classes {None, Balanced} strategies tend to fare better.These results, obtained on the test set, are well correlated with the results we obtain during model selection on the validation set; our choice of SVR-RQ as a representative method for BC+MLA was indeed based on the performance of the different multi-label aggregation methods obtained in the validation phase.

Table 13 :
Values of RAE obtained in our experiments for different amounts of shift using PCC as the base quantifier.Notational conventions are as in Table3.

Table 14 :
Values of RAE obtained in our experiments for different amounts of shift using ACC as the base quantifier.Notational conventions are as in Table3.

Table 15 :
Values of RAE obtained in our experiments for different amounts of shift using PACC as the base quantifier.Notational conventions are as in Table3.

Table 16 :
Values of RAE obtained in our experiments for different amounts of shift using SLD as the base quantifier.Notational conventions are as in Table3.