QuaPy: A Python-Based Framework for Quantification

QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data. While quantification can be trivially performed by applying a standard classifier to each unlabelled data item and counting how many data items have been assigned to each class, it has been shown that this"classify and count"method is outperformed by methods specifically designed for quantification. QuaPy provides implementations of a number of baseline methods and advanced quantification methods, of routines for quantification-oriented model selection, of several broadly accepted evaluation measures, and of robust evaluation protocols routinely used in the field. QuaPy also makes available datasets commonly used for testing quantifiers, and offers visualization tools for facilitating the analysis and interpretation of the results. The software is open-source and publicly available under a BSD-3 licence via https://github.com/HLT-ISTI/QuaPy, and can be installed via pip (https://pypi.org/project/QuaPy/)


Introduction
Quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of training models ("quantifiers") that estimate the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data items [15]. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about Donald Trump, such a model may be tasked to estimate the percentage of these 100,000 tweets which display a Positive stance towards Trump (and to do the same for classes Neutral and Negative). In other words, quantification stands to classification as aggregate data stand to individual data. Quantification is of special interest in fields such as the social sciences [17], epidemiology [19], market research [9], and ecological modelling [2], since these fields are inherently concerned with aggregate data; however, quantification is also useful in applications outside these fields, such as in enforcing the fairness of classifiers [4], performing word sense disambiguation [5], allocating resources [13], and improving the accuracy of classifiers [28]. estim_prevalence = model.quantify(data.test.instances) 11 true_prevalence = data.test.prevalence() 12 13 error = qp.error.ae(true_prevalence, estim_prevalence) 14 print('Absolute Error (AE)', error) As mentioned above, quantification is particularly useful in scenarios where distribution shift may occur. Any quantification model should thus be tested across different data samples characterized by different class prevalence values. QuaPy implements sampling procedures and evaluation protocols that automate this endeavour.
The paper is structured as follows. In Section 2 we briefly describe the quantifier training methods included in QuaPy, while in Section 3 we present a number of datasets that have been previously used in quantification research and that we include in the QuaPy suite. Section 4 is devoted to quantifier evaluation, and discusses the evaluation measures and evaluation protocols that we make available within QuaPy. Section 5 turns to model selection, discussing the hyperparameter optimization protocols implemented within QuaPy, while Section 6 illustrates the tools that we make available for visualizing the results of quantification experiments. Section 7 discusses some experiments that we have carried out in order to showcase some among the features of QuaPy. In Section 8 we give some concluding remarks. The meaning of these functions should be familiar to anybody accustomed to the scikit-learn environment [24], since the class structure of QuaPy is directly inspired by scikit-learn's "estimators". 4 Functions fit and quantify are used to train the model and to return class prevalence estimates, respectively, while functions set params and get params allow a model-selecting routine (see Section 5) to automate the process of hyperparameter optimization.
Quantification methods can be classified as belonging to the aggregative, non-aggregative, or meta classes. Aggregative methods are characterized by the fact that quantification is obtained as an aggregation of the outputs returned by a classification process for the individual documents. Non-aggregative methods analyse instead the sample of unlabelled documents as a whole, without resorting to the classification of individual data items. Finally, meta-quantifiers are built on top of other quantifiers, and generate their predictions by analysing the predictions made by the underlying quantifiers. We will briefly present these three classes in the next three subsections.

Aggregative methods
Most of the methods proposed in the literature and included in QuaPy are aggregative. QuaPy models aggregative quantifiers by means of the abstract class AggregativeQuantifier. This class extends BaseQuantifier, providing a default implementation of the quantify method based on the aggregate function, that has to be implemented, i.e., 1 def quantify(self, instances): Implementing an aggregative method only requires overriding the aggregate method.
The AggregativeQuantifier class implements the rest of the process, and is designed to work with any scikit-learn estimator. Working with packages or machine learning tools other than scikit-learn only requires overriding the classify method, which takes as input the individual data items in the sample and returns the corresponding classification predictions (see Section 2.1.5).
Probabilistic aggregative methods are a subclass of aggregative methods, which, instead of the "crisp" decisions returned by a categorical classifier, use the posterior probabilities returned by a probabilistic classifier. Probabilistic aggregative methods inherit from the abstract class AggregativeProbabilisticQuantifier, which extends AggregativeQuantifier, by providing a default implementation of the quantify method as follows: 1 def quantify(self, instances): The method posterior probabilities, similarly to the more general case, is designed to work together with the predict proba method of any probabilistic classifier in scikit-learn. QuaPy also allows using the scikit-learn's crisp estimators that do not come with an implementation of the predict proba method (e.g., LinearSVC). In this case, the estimator is converted into a probabilistic classifier by means of a calibration method [25]. 5 Packages other than scikit-learn can be used as well by providing a custom implementation of the posterior probabilities method (see Section 2.1.5).
One advantage of aggregative methods (probabilistic or not) is that the evaluation according to any sampling procedure (e.g., the artificial prevalence protocol -see Section 4) can be carried out very efficiently, since the entire set of unlabelled items can be pre-classified once for all at the beginning, and the estimation of class prevalence values for different samples can directly reuse these predictions, with no need to reclassify each individual data item every time. QuaPy takes advantage of this property to drastically speed up any routine that has to do with quantification on multiple samples drawn from the same set, as is customarily the case in quantification, both in the performance evaluation phase (Section 4) and in the model selection phase (Section 5).

Classify & Count and its variants
QuaPy provides implementations for Classify & Count (CC) and its variants, i.e., • CC (Classify & Count), the simplest aggregative quantifier, that simply relies on the label predictions of a classifier to deliver class prevalence estimates; • ACC (Adjusted Classify & Count) [13], the "adjusted" variant of CC, that corrects the predictions of CC according to the "misclassification rates" (see below) of the classifier; • PCC (Probabilistic Classify & Count) [3], the probabilistic variant of CC that relies on the posterior probabilities returned by a probabilistic classifier; • PACC (Probabilistic Adjusted Classify & Count) [3], which stands to PCC as ACC stands to CC.
Note that the adjusted variants (ACC and PACC) need to estimate the parameters (the "misclassification rates") required for performing the adjustment; the estimation uses a validation set carved out of the labelled set. The specific form of parameter optimization can be set at construction time or at fitting time using the argument val split, either by indicating a float in (0,1) specifying the fraction of the training data to be used as a held-out validation set, or by indicating an int specifying the number of folds to be used in a k-fold cross-validation (k-FCV) process, or by explicitly passing a set of instances to be used as the validation set (i.e., an instance of LabelledCollection -see Section 3).

Forman's variants of ACC
QuaPy also provides implementations of a series of binary quantification methods, proposed by Forman in [12,13] as variations of ACC, and whose goal is to bring improved stability to the denominator of the adjustment. 6 The methods are based on different heuristics for choosing a decision threshold that would allow for more true positives and many more false positives, on the grounds this would deliver larger denominators.
QuaPy implements the methods X (which looks for the threshold that yields tpr(y) = 1 − fpr(y)), MAX (which looks for the threshold that maximizes tpr(y) − fpr(y)), T50 (which looks for the threshold that makes tpr(y) closest to 0.5). QuaPy also implements MS (Median Sweep), a method that generates class prevalence estimates for all decision thresholds and returns the median of them all; and MS2, a variant that computes the median only for cases in which tpr(y) − fpr(y) > 0.25.

The Saerens-Latinne-Decaestecker algorithm
The Saerens-Latinne-Decaestecker (SLD) algorithm [28,7] (sometimes also called EMQ, for Expectation Maximization Quantifier) is a probabilistic quantifier-generating method. SLD consists of using the well-known Expectation Maximization algorithm to iteratively update the posterior probabilities generated by a probabilistic classifier and the class prevalence estimates obtained via maximum-likelihood estimation, in a mutually recursive way, until convergence. Although this method was originally proposed for improving the quality of the posterior probabilities returned by a probabilistic classifier, and not for improving its class prevalence estimates, SLD has proven to be among the most effective quantifiers in many experiments [22,21,29].

The HDy method
HDy [16] is a probabilistic method for training binary quantifiers, that models quantification as the problem of minimizing the divergence (in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier. One of the distributions is generated from the unlabelled examples and the other is generated from a validation set. This latter distribution is defined as a mixture of the class-conditional distributions of the posterior probabilities returned for the positive and negative validation examples, respectively. The parameters of the mixture thus represent the estimates of the class prevalence values.
Since the method requires a validation set to estimate the parameters of the mixture model, the constructor and fit method of HDy receive as input the argument val split, whose semantics is the same as in ACC and PACC. 6 In the binary case, the ACC adjustment comes down to computingp ACC (y) =p CC (y)−f pr(y) tpr(y)−f pr(y) in whichp CC (y) is the prevalence of class y as estimated by CC, andt pr(y) andf pr(y) stand for the true positive rate and false positive rate of the classifier, as estimated in the validation phase. The above-mentioned numerical instability arises whent pr(y) ≈f pr(y).

Quantifiers based on Explicit Loss Minimization
The quantifiers based on Explicit Loss Minimization (ELM) represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. QuaPy implements the following ELM-based methods, all relying on Joachims' SVM perf structured output learning algorithm [18]: 7 • SVM(Q), which attempts to minimize the Q loss, that combines a classification-oriented loss and a quantification-oriented loss, as proposed in [1]; • SVM(KLD), which attempts to minimize the Kullback-Leibler Divergence, as proposed in [10] and as first used in [11]; • SVM(NKLD), which attempts to minimize a version of the Kullback-Leibler Divergence normalized via the logistic function, as first used in [11]; • SVM(AE), which uses Absolute Error as the loss, as first used in [22]; • SVM(RAE), which uses Relative Absolute Error as the loss, as first used in [22].
All ELM-based methods can train binary quantifiers only, since they rely on SVM perf , which is an inherently binary system. However, QuaPy allows the conversion of binary quantifiers into multi-class quantifiers (see Section 2.3).

Methods for training meta-quantifiers
Meta-quantifiers base their estimates on the estimates produced by other quantifiers, and are defined in the qp.method.meta module.

Ensembles:
A quantification ensemble receives as input any quantification method (any instance of BaseQuantifier). QuaPy implements some among the "ensemble" variants proposed in [27,26], that train different members of the ensemble using different samples of the original training set; in particular: • Averaging (policy='ave', default): computes class prevalence estimates as the average of the estimates returned by the base quantifiers.
• Training Prevalence (policy='ptr'): applies a dynamic selection to the ensemble's members by retaining only those members such that the class prevalence values in the samples they use as training set are closest to preliminary class prevalence estimates computed as the average of the estimates of all the members. The final estimate is recomputed by considering only the selected members.
• Distribution Similarity (policy='ds'): performs a dynamic selection of base members by retaining the members trained on samples whose distribution of posterior probabilities is closest, in terms of the Hellinger Distance, to the distribution of posterior probabilities in the test sample; • Performance (policy='<any-error-metric>'): performs a static selection of the ensemble members by retaining those that minimize a quantification error measure, which is passed as an argument.
When using either dynamic or static selection policies, one has to set the red size parameter, which defines the number of members that have to be retained.

The QuaNet recurrent quantifier:
QuaPy provides an implementation of QuaNet, a deep-learning-based method for performing quantification on samples of textual documents, presented in [8]. 8 QuaNet processes as input a list of document embeddings (see below), one for each unlabelled document along with their posterior probabilities generated by a probabilistic classifier. The list is processed by a bidirectional LSTM that generates a sample embedding (i.e., a dense representation of the entire sample), which is then concatenated with a vector of class prevalence estimates produced by an ensemble of simpler quantification methods (CC, ACC, PCC, PACC, SLD). This vector is then transformed by a set of feed-forward layers, followed by ReLU activations and dropout, to compute the final estimations.
QuaNet thus requires a probabilistic classifier that can provide embedded representations of the inputs. QuaPy offers a basic implementation of such a classifier, based on convolutional neural networks, that returns its next-to-last representation as the document embedding.

Using binary quantifiers in multi-class quantification
QuaPy allows a set of binary quantifiers, one for each class, to be assembled into a single-label multi-class quantifier, by adopting a "one-vs-all" strategy. This takes the form of computing prevalence estimates independently for each class (i.e., via binary quantification) via independently trained binary quantifiers, and then normalizing the resulting vector of prevalence values (via L1-normalization) so that these values sum up to one. In QuaPy this is possible by wrapping any binary quantifier within a OneVsAll object. For example, a quantifier defined as model=OneVsAll(SVMQ()) will allow SVMQ to work with single-label multiclass datasets.

Datasets
QuaPy makes available a number of datasets that have been used for experimentation purposes in the quantification literature, and specifically: 9 • Reviews: a collection of 3 datasets of customer reviews about (1) Kindle devices (KINDLE), (2) the Harry Potter's book series (HP), both already used in [11], and (3) the well-known IMDB movie reviews dataset (IMDB) [20]. All reviews are classified according to (binary) sentiment polarity. The number of training documents range from 3821 (KINDLE) to 25000 (IMDB) and present examples in which labelled data are balanced (IMDB, 50% positives), imbalanced (KINDLE, 92% positives), and severely imbalanced (HP, 98% positives). • Twitter Sentiment: 11 datasets of tweets labelled by sentiment, as used in [14]. The raw text of the tweets is not available due to Twitter's Terms of Service, and tweets are instead provided as tf-idf -weighted vectors.
Similarly to the Reviews datasets, these are high-dimensional datasets, with dimensionalities ranging from 199,151 to 1,215,742. These datasets use three sentiment labels (Positive, Neutral, Negative), and are thus useful for testing non-binary quantification methods. • UCI: 33 binary datasets from the UCI Machine Learning repository [6], as used in [27]. 10 Differently from the previous datasets, these non-textual datasets are low-dimensional (with dimensionalities ranging from 3 to 256), thus providing diversity, in terms of type of data, with respect to to the previous two sets of datasets.
QuaPy defines a simple Dataset interface that allows importing any custom dataset into the QuaPy environment. A Dataset object in QuaPy is essentially a pair of LabelledCollection objects, playing the role of the training set and of the test set, respectively. LabelledCollection is a data class consisting of the instances and labels. This class implements the core sampling functionality in QuaPy, which is then exploited by the evaluation tools (Section 4.2) and by the model selection tools (Section 5). QuaPy supports the definition of samples consistent across runs, in order to allow testing different quantification methods on the very same samples.

Evaluation
Evaluating a quantifier requires measuring how good it is at predicting the class prevalence values of a test sample, which may have different class prevalence values than those observed on the training data.
The evaluation of quantifiers is a complex task, since it depends on many aspects.
For example, the same difference, in absolute value, between the true and the predicted prevalence values may have a different "cost" depending on the original true prevalence value: predicting 0.5 prevalence when the true prevalence is 0.49 can be considered, in some application contexts, a less blatant error than predicting a prevalence of 0.01 when the true prevalence is 0.00. In some other application contexts, though, the two above-mentioned estimation errors may be considered equally serious [30]. This means that sometimes we may want to use a certain evaluation measure and some other times we may want to use a different one.
Additionally, for some application contexts we may be interested in measuring the quantification error only on samples whose class prevalence values do not differ too much from those of the training set, because we assume distribution shift, in practice, to always be limited in magnitude. Conversely, in some other application contexts, we may want to test our quantifiers also in situations characterized by extreme values of distribution shift, because we expect our environment to be characterized by high variability, and because we want our quantifiers to be robust also to possibly extreme amounts of shift.
As a result, an environment for experimenting with quantification must not only be endowed with several evaluation measures, but it also must allow the experimentation to be carried out according to different evaluation protocols.

Error measures
Several error measures have been proposed in the literature [30], and QuaPy implements a rich set of them: • ae: absolute error

Evaluation protocols
QuaPy implements both the natural prevalence protocol (NPP) and the artificial prevalence protocol (APP).
In the NPP, the test set is sampled randomly, so that most samples exhibit class prevalence values not to different from those of the test set.
In the APP, the test set is instead sampled in a controlled way, in order to generate samples characterized by different, pre-specified prevalence values, so as to cover, with uniform probability, the full spectrum of class prevalence values.
In the APP the user specifies the number of equidistant points to be generated from the interval [0,1]. For example, if n prevs=11 then, for each class, the prevalence values [0.0, 0.1, ..., 0.9, 1.0] will be used. This means that, for two classes, the number of different sampled prevalence values will be 11 (since, once the prevalence of one class is determined, the other one is also). For 3 classes, the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66. The number of valid combinations (i.e., that sum up to one) that will be produced for a given value of n prevpoints across n classes can be determined by invoking quapy.functional.num prevalence combinations, e.g.: 1 import quapy.functional as F In this example, n = 1771. The last argument, n repeats, sets the number of samples that will be generated for any valid combination (typical values are 10 or higher, in order to support the computation of standard deviations and to perform statistical significance tests).
One can instead work the other way around, i.e., set an evaluation budged so as to obtain the number of prevalence values that will generate a number of samples close but no higher than the fixed budget, e.g.: Here the function get nprevpoints approximation determines that for the given budget and 4 classes, by setting n prevpoints= 30 the number of samples will be n= 4960.
QuaPy implements evaluation functions that allow the user to either specify the n prevpoints value or an evaluation budget. The following script shows a full example in which a PACC model relying on a classifier trained via logistic regression, is tested on the HP dataset by means of the APP protocol on samples of size 500, setting a budget of 1000 test samples, in terms of various evaluation metrics (mae, mrae, mkld). The resulting report is a pandas dataframe:

Model selection
Quantification has long been regarded as a by-product of classification, which means that the model selection (i.e., hyperparameter optimization) strategies customarily adopted in quantification have simply been borrowed from classification. It has been argued in [22] that specific model selection strategies should be adopted for quantification. That is, model selection strategies for quantification should minimize quantification-oriented loss measures, and be carried out on a variety of scenarios exhibiting different degrees of distribution shift.

QuaPy
supports quantification-oriented model selection by implementing, in the class qp.model selection.GridSearchQ, a grid-search exploration over the space of hyperparameter combinations that evaluates each such combination by means of a given quantification-oriented error metric (see Section 4.1), and according to either the APP (the default value) or the NPP.
The following is an example of quantification-oriented model selection using GridSearchQ. In this example, model selection is performed with a fixed budget of 1000 evaluations for each combination of hyperparameters. The loss function to miminize is MAE, a quantification-oriented error measure, as evaluated on randomly drawn samples at equidistant prevalence values covering the entire spectrum (APP protocol) on a stratified held-out portion consisting of 40% of the training set.  In this example, the system returns: best hyper-params={'C': 0.1, 'class_weight': 'balanced'} MAE=0.20342

Result visualization
QuaPy implements some plotting functions that can be useful in displaying the performance of the tested quantification methods: • Diagonal plot: The diagonal plot shows a very insightful view of the quantifier's performance, i.e., it plots the predicted class prevalence (on the y-axis) against the true class prevalence (on the x-axis), averaging across all samples characterized by the same true prevalence. Unfortunately, this visualization device is inherently limited to binary quantification (one can simply generate as many diagonal plots as there are classes, though, by indicating which class should be considered the target of the plot). • Error-by-Shift plot: This plot displays the quantification error made by a quantifier as a function of the distribution shift between the training set and the test sample, averaging across all samples characterized by the same amount of distribution shift. Both quantification error and distribution shift can be measured in terms of any measure among those described in Section 4, and can be computed and plotted both in the binary case and in the non-binary case. • Bias-Box plot: This plot aims at displaying, by means of box plots, the bias that any quantifier exhibits with respect to the training class prevalence values. The bias can be broken down into different bins, e.g., distinguishing the bias in cases of low, medium, and high prevalence shift.
In Figure 1 we show examples of each of the above types of plot, as resulting from the experiments that we will discuss in Section 7.

Experiments
In this section we present some experiments that we have carried out in order to showcase some among the features of QuaPy. The code to replicate all these experiments, and to generate the relative tables and plots, can be accessed via GitHub. 12 As the datasets, we consider the set of UCI Machine Learning datasets used in [27], consisting of 30 binary datasets (see Section 3). 13 Following [27], we remove the "frustratingly easy" datasets acute.a, acute.b, and iris.1, where even a trivial CC approach manages to yield zero quantification error. The datasets do not come with a predefined train/test split; we thus carry out an evaluation based on 5-fold cross-validation and report the average quantification error across the 5 test folds. Each iteration thus defines a training set L (4 folds) and a test set U (1 fold). We choose AE as our error metric and adopt the APP protocol for evaluation. For each method and test set U we generate m = 100 different random samples of q = 100 instances each, at prevalence values in the range [0.00, 0.05, . . . , 0.95, 1.00] via selective undersampling, and report the resulting MAE value. Each MAE value we report corresponds to the average of 10,500 experiments (100 samples × 21 class prevalence values × 5 folds).
For model selection, we split the training set L into a proper training set L Tr (consisting of 60% of L) and a held-out validation set L Va (the remaining 40%) in a stratified way. For each combination of hyperparameters we train the model using L Tr and evaluate the performance on L Va in terms of MAE by following the APP protocol [22]; in this case we use q = 100 and m = 25. Once the best values of the hyperparameters have been identified, we re-train the method using the entire training set.
All quantifiers we consider in this demonstration are either aggregative quantifiers or ensembles of aggregative base quantifiers, which means that all of them rely on an underlying classifier. We consider Logistic Regression (LR) as our default classifier-training algorithm in all cases, except for the methods from the "explicit loss minimization" camp, which instead natively rely on SVM perf . The set of hyperparameters to optimize include the regularization parameter C (common to LR and SVMs), taking values in {10 −3 , 10 −2 , . . . , 10 2 , 10 3 }, and the parameter class weight (only for LR), which may take values balanced (which has the effect of giving more weight to test examples from less frequent classes) or None (which has the effect of giving the same weight to all test examples).
As the learning methods we consider CC, its variants PCC, ACC, PACC, Forman's variants 14 MAX, MS, MS2, the expectation-maximization-based SLD method, 15 the mixture model HDy, SVM(AE) as the representative of the "explicit loss minimization" family 16 , and E(HDy) DS as the representative of ensemble methods (since it is the one which fared best in the experiments of [26]). For E(HDy) DS we set the number of base quantifiers to size=30 and the number of members to be selected dynamically to red size=15, and perform model selection independently for each base member. Table 1 reports the AE results of this experimentation. Our results are fairly consistent with those reported in [21,22], and seem to indicate that the strongest method of all is SLD, which obtains the best average MAE result, the best average rank, and is the best method on 13 datasets out of 30. Methods E(HDy) DS (8 times best method), PACC (4 times best method), and (to a lesser extent) ACC (2 times best method), also seem to perform very well, obtaining 14 To avoid clutter, we report only the three Forman's variants that have worked best in most of the experiments reported in [12]. Additional experiments that we have run, and that we do not report in this paper, confirm that T50 and X perform much worse than the other methods. 15 Despite the fact that classifiers trained by LR are considered inherently well-calibrated (see, e.g., https://scikit-learn. org/stable/modules/calibration.html), [?] has found that re-calibrating LR brings additional benefits to SLD. In our experiments we thus instantiate SLD with a re-calibrated version of LR, and we indeed observe this to improve results noticeably. However, re-calibrating does not deliver any improvement for any other probabilistic quantifier that we test here, and instead shows a tendency to deteriorate the results. For this reason, we use a re-calibrated LR only for SLD, and a "standard" LR in all other cases. 16 Among all ELM-based methods, we choose the one that minimizes the same loss that we adopt for evaluating the results. We do not consider other variants (SVM(Q), SVM(KLD), SVM(NKLD), SVM(RAE)) since, in recent evaluations (see, e.g., [21,22]), they have consistently underperformed other competitors. average ranks not statistically significantly different from the best average rank (obtained by SLD). Method SVM(AE) tends to produce results that are markedly worse than the rest of competitors. In line with the observations of [29], none of the variants MAX, MS, MS2 manages to improve over ACC. Also in line with the findings of [26], the ensemble E(HDy) DS clearly outperforms the base quantifier HDy it is built upon. and test samples. This diagram reveals that PACC, SLD, and E(HDy) DS are the methods displaying the lowest bias overall, given that their boxes (delimiting the first and third quartiles) are the most squashed, and given that their whiskers (maximum and minimum, disregarding outliers) are the shortest. One interesting fact that is clearly revealed by this box-plot is, in line with what reported in [26], the ability of the ensemble method (E(HDy) DS ) to reduce the variance of the base quantifiers it is built upon (HDy). It is also interesting to note how the heuristic implemented in MS2 drastically reduces the variance produced by MS. The last plot (bottom right) displays error bias trends with samples binned according to their true prevalence; it clearly shows how the "unadjusted" methods (e.g., CC, PCC) display positive bias for low prevalence values (thus overestimating the true prevalence) and negative bias for high prevalence values (thus underestimating the true prevalence), while the "adjusted" versions (ACC and PACC) reduce this effect, since they tend to display box-plots centred at zero bias in those cases. This plot also clearly explains that MS tends to display a huge positive bias in the low-prevalence regime, while SVM(AE) displays a huge negative bias in the high-prevalence regime.
Note that the results presented here are just for the purposes of illustrating the functionality of QuaPy, and should not be taken as an absolute statement on the relative merits of the different quantification methods. For instance, a different batch of experiments (those reported in [21], and dealing with sentiment quantification on datasets of tweets), tell a slightly different story, since they report a much larger difference in accuracy between top-performing methods (SLD, PACC, ACC) and lesser performing ones (CC, PCC, SVM(AE), and others). One of the main differences between the experiments in this paper and those in [21] is that we here work on binary quantification only, while [21] tackled single-label multiclass quantification (since all datasets used there were ternary). As always, a complete understanding of the relative merits of different learning methods can only be obtained through multiple, varied sets of experiments (see also [29]).

Conclusions
Quantification is a research topic of growing interest in the areas of machine learning, data mining, and information retrieval. We have presented QuaPy, an open-source, Python-based package that makes available a rich set of quantification methods, tools, experimental protocols, and datasets, with the goal of supporting an efficient and scientifically correct experimentation of quantification methods. We think that QuaPy will be of help to machine learning researchers that work on developing new quantification algorithms, as it provides them with many baselines to compare against, datasets to test their methods on, and tools that implement all the typical steps of quantification-based experimentation, from data preparation to the visualization of results. We think that QuaPy will be of help also to researchers and practitioners in other disciplines who simply need to apply quantification in their own work, as it provides them with a streamlined workflow, a wide choice of different approaches, and quick access to the package thanks to the support of installation based on pip.
QuaPy is an open-source project, licensed under the BSD-3 licence; its repository will be updated following the advances in quantification research, and it is open to contributions of new methods, tools, and datasets.