A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.


INTRODUCTION
Single-label text classification is the task of training a text classifier h : X → Y that labels each document x i ∈ X with a class h(x i ) ∈ Y; X is a (possibly infinite) set of documents (the domain), while Y = {y 1 , . . . ,y |Y | } is a finite set of classes (the codeframe, or classification scheme).
The classifiers trained by means of modern machine learning methods usually return, together with the class assigned to the document, a vector (s (x i , y 1 ), . . . , s (x i , y |Y | )) of confidence scores, where s (x i , y j ) by and large represents the confidence (or the strength of belief) that the classifier has in the fact that x i belongs to y j ; the class h(x i ) assigned to document x i is thus the one with the highest confidence score, i.e., Classifiers that return confidence scores are sometimes called scoring classifiers [12]. Without loss of generality, 1 we may assume that these confidence scores are actual probabilities (if so, these are called posterior probabilities, or simply posteriors), i.e., we may assume that the vector being returned has the form (Pr(y 1 |x i ), . . . , Pr(y |Y | |x i )), where |Y | j=1 Pr(y j |x i ) = 1 and Pr(y j |x i ) represents the probability that the classifier "subjectively" attributes to the fact that x i belongs to class y j . Rather than simple classifiers, these models are full-blown probability estimators.
The posteriors play an important role in several tasks, a role that goes beyond allowing to take a classification decision by means of Equation (1). One of these tasks is document ranking, as when the documents are ranked in decreasing order of the probability Pr(y j |x i ) that they belong to a certain class y j ; ranking is useful, for instance, when performing active learning by means of relevance sampling [21], or when one needs to choose the best k documents for a certain class, or when one needs to choose the best k classes for a certain document. Another such task is costsensitive classification, where classification is performed in such a way that h(x i ) = arg min y j ∈Y y l ∈Y λ jl · Pr(y l |x i ), where λ jl represents the "cost" of classifying a document in class y j when it should have been classified in class y l (this cost is equal to 0 when j = l and higher than 0 when j l); in other words, x i is assigned to the class such that the expected cost (i.e., the risk) of assigning x i to it is minimum. Example applications of cost-sensitive text classification may be found, for instance, in spam filtering [5] or in technology-assisted review [30].
Of course, to guarantee that single-label multiclass classification, ranking, cost-sensitive classification, and other such tasks are executed with high accuracy, the posteriors must be accurate, too. An intuition of what "accurate posteriors" means can be provided by the following example: For instance, if 10% (respectively, 90%) of all the documents x i for which Pr(y j |x i ) = 0.5 indeed belong to y j , we can say that the classifier has overestimated (respectively, underestimated) the probability that these documents belong to y j , and that their posteriors are thus inaccurate. Indeed, we say (see for instance Reference [15]) that the posteriors Pr(y j |x i ), where x i belongs to a set S = {x 1 , . . . , x |S | }, are (perfectly) calibrated (i.e., accurate) when, for all a ∈ [0, 1], it holds that 2 |{x i ∈ S ∩ y j | Pr(y j |x i ) = a}| |{x i ∈ S | Pr(y j |x i ) = a}| = a.
The classifiers trained by means of some learners (such as logistic regression) are known to return reasonably well-calibrated probabilities. Those trained by means of some other learners (such as Naïve Bayes) return probabilities that are known to be not well calibrated [9]. Yet other learners (such as SVMs or AdaBoost) train classifiers that return confidence scores that are not probabilities (i.e., that do not range on [0,1] and/or that do not sum up to 1). To address these two latter cases, probability calibration mechanisms exist (see, e.g., References [28,29,31,40,41]) that convert the outputs of these classifiers into well-calibrated probabilities. However, even when using text classifiers that tend to return well-calibrated probabilities, or even when using the probability calibration methods mentioned above, the accuracy of the posteriors tends to be low if the problem setting exhibits dataset shift (see, e.g., Reference [33]), i.e., if the joint distribution p L (x, y) in the training set is different from the joint distribution p U (x, y) in the unlabelled set. One particular type of dataset shift that interests us is distribution shift [2], which occurs when the distribution of the prior probabilities Pr(y j ) (or simply priors) in the training set (noted as p L (y)) and that of the prior probabilities in the unlabelled set (noted as p U (y)) differ. To see why this is the case, take the probabilistic classifier and note that the posterior Pr(y j |x i ) on the left-hand side directly depends on the prior Pr(y j ) on the right-hand side. Since the prior Pr(y j ) has been estimated on the training set (i.e., its value has been set to Pr L (y j ), which is distributed as p L (y)), if Pr L (y j ) is higher (respectively, lower) than Pr U (y j ) (which is distributed as p U (y)), then the posteriors Pr(y j |x i ) of the documents in U will be overestimated (respectively, underestimated). 3 Ideally, to have well-calibrated posteriors even in the presence of distribution shift, we would need to set Pr(y j ) in Equation (4) to Pr U (y j ), and not to Pr L (y j ). But this is impossible, since Pr U (y j ) is unknown at training time. The only known way out of this conundrum is provided by the Saerens-Latinne-Decaestecker algorithm (that we here call SLD for brevity), 4 an algorithm that iteratively re-estimates the priors Pr U (y j ) of the unlabelled set and adjusts the posteriors Pr(y j |x i ) in a mutually recursive way [34]. This algorithm is essentially unique in its kind, and, to the best of our knowledge, no other algorithm that attempts to adjust the posteriors in the presence of distribution shift has been proposed since its publication. (An exception is the algorithm described in Reference [38]; in Section 6, we discuss why we do not consider it as a contender.) As a result, SLD has become a standard and is frequently used in scenarios characterized by distribution shift, either when the goal is improving the accuracy of the posteriors or when the goal is obtaining estimates of the priors more accurate than can be obtained by the trivial "classify and count" method (the latter task is known as supervised prevalence estimation, or quantification [19]). 2 Perfect calibration is usually unattainable on any non-trivial dataset; however, calibration comes in degrees (and the quality of calibration can indeed be measured-see Section 3.1.2), so efforts can be made to obtain posteriors that are as close as possible to their perfectly calibrated counterparts. 3 In other words, that distribution shift brings about a low quality of the posteriors is due to the fact that distribution shift, as all types of dataset shift, invalidates the iid assumption (according to which the training examples and the unlabelled examples are drawn from the same distribution), on which probability calibration methods rely. 4 In a number of other publications [17,22,23] the same algorithm was called EMQ, standing for "Expectation Maximization for Quantification"; in yet other publications [3] it is called RS, standing for "rescaling algorithm." However, in recent experiments aimed at improving the quality of cost-sensitive text classification in technology-assisted review [22,23], SLD has not delivered any measurable improvement in the quality of the posteriors. Since these experiments were limited in scope, we have then decided to engage in a large-scale experimentation of SLD, with the goal of reassessing its true ability at (i) accurately re-estimating the priors Pr U (y j ) of the unlabelled set and (ii) improving the quality of the posteriors Pr(y j |x i ) of the unlabelled documents. Note that goal (ii) is more important than goal (i), since the ability of SLD at estimating the priors has been systematically tested in previous works (e.g., Reference [10]), and since (as mentioned before) SLD is essentially the only known algorithm for improving the quality of already calibrated posteriors, while there are many alternatives to it (see the extensive review by González et al. [19]) when it comes to estimating the priors. We thus present systematic experiments involving different learners, different datasets, and different amounts of distribution shift, in which we try to assess the real benefits of using SLD.
The rest of the article is structured as follows: Section 2 introduces and discusses the SLD algorithm in detail. In Section 3, we present the systematic experimentation to which we have subjected SLD, and in Section 4, we present its results; while in Section 5, we discuss exactly which kinds of distribution shift we target in our experiments. Section 6 discusses some related work, while Section 7 concludes.

THE SLD ALGORITHM
We assume a training set L of labelled examples and a set U , examples whose true labels t (x i ) ∈ Y = {y 1 , . . . ,y |Y | } are unknown to the system. SLD, proposed by Saerens et al. [34], is an instance of Expectation Maximization [8], a wellknown iterative algorithm for finding maximum-likelihood estimates of parameters (in our case: the class prior probabilities) for models that depend on unobserved variables (in our case: the class labels). Pseudocode of the SLD algorithm is here included as Algorithm 1.
Essentially, SLD iteratively updates (Line 11) the class priors by using the posterior probabilities computed in the previous iteration and updates (Line 13) the posterior probabilities by using the class priors computed in the present iteration in a mutually recursive fashion.
The main goal is to adjust the posteriors and re-estimate the priors in such a way that they are consistent with each other, where this "mutual consistency" means that they should be such that In Appendix A, we show that Equation (5) is a necessary (albeit not sufficient) condition for the posteriors Pr(y j |x i ) of the documents x i ∈ U to be calibrated. SLD may thus be viewed as making a step towards calibrating these posteriors. The algorithm iterates until convergence, i.e., until the class priors become stable and Equation (5) is satisfied. The convergence of SLD may be tested by computing how the distribution of the priors at iteration (s − 1) and that at iteration s still diverge; this can be evaluated, for instance, in terms of absolute error, i.e., 5 ALGORITHM 1: The SLD algorithm [34]. Input: Classpriors Pr L (y j ) on L, for all y j ∈ Y; Posterior probabilities Pr(y j |x i ),for all y j ∈ Y and for all x i ∈ U ; Output: EstimatesPr U (y j ) of class prevalence values onU ,for all y j ∈ Y; Updated posterior probabilitiesPr(y j |x i ), for all y j ∈ Y andfor all x i ∈ U ; // Initialization  In the experiments of Section 3, we decree that convergence has been reached when AE(p (s−1) U ,p (s ) U ) < 10 −6 ; we stop SLD when we have reached either convergence or the maximum number of iterations (that we set to 1,000).
At each iteration of the algorithm, all the posteriors relative to class y j are multiplied by the same amountPr (s ) . As a consequence, the net effect of SLD is to multiply all these posteriors by the same amountPr U (y j )/Pr (0) U (y j ) so the resulting posteriors Pr(y j |x i ) are consistent with the resulting class priorPr U (y j ), i.e., so Equation (5) is satisfied; in other words, SLD is an iterative rescaling algorithm. The posteriors for different classes, though, do not get multiplied by the same amount; this is somehow obvious, since at the end of the process the posteriors for 19:6 A. Esuli et al. document x i must all sum up to 1, which means that if the posteriors for a class y all end up increasing, there must be at least a class y whose posteriors all end up decreasing. SLD, as proposed by Saerens et al. [34] and as described here, addresses single-label classification, i.e., the task in which exactly 1 out of |Y | classes must be assigned to each document. This means that SLD can be used for binary classification (which is single-label classification with |Y | = 2), for single-label multiclass classification (which is single-label classification with |Y | > 2), and for multi-label classification (which is the task in which any number of classes in Y can be assigned to a document), since multi-label classification can be trivially recast into |Y | independent binary classification tasks.
It is worth pointing out something that Saerens et al. [34] did not observe, i.e., that the combination of (i) a learner that trains classifiers to return posterior probabilities, and (ii) the SLD algorithm that improves the quality of the posterior probabilities for a given set of unlabelled documents U , might be called a transductive algorithm [39], since it uses training documents to infer posterior probabilities only for a specific, finite set of unlabelled documents known at training time. This is different from standard inductive algorithms, which use training documents to infer a generalpurpose hypothesis that can later be applied to the entire domain. One aspect of this transductive nature is that SLD must operate "holistically," i.e., on entire sets of unlabelled documents and cannot, for instance, update the posteriors of individual unlabelled documents in isolation of each other; another aspect is that, as Saerens et al. [34, p. 35] put it, "the model has to be completely refitted each time it is applied to a new data set." Interestingly enough, SLD was originally designed with the goal of improving the posteriors to improve the accuracy of classification (by means of Equation (1)) in the presence of distribution shift. The fact that it also allows estimating the priors in a more accurate way than by just "classifying and counting" was considered a by-product by its authors. However, in the years that followed, thanks to increased interest in the "quantification" task (see Section 1), SLD became a popular baseline for algorithms whose goal was the estimation of the priors, and in recent extensive experimentation it has been found to be a top-notch performer for this task [25].
In Reference [34], the quality of the posteriors generated by means of SLD was measured in terms of error rate, i.e., the fraction of classification decisions that are wrong. However, a major difference between error rate and the measures we will instead use for the same purpose (see Section 3.1.2) is that the former, unlike the latter, evaluates not the posterior probabilities per se but the classification decisions that are based on them. Error rate is thus only an "indirect" measure of the quality of the posteriors, and a coarse one, too. To see this, let us assume we are dealing with binary classification, and let us consider a document x i such that its true class is y 1 . According to Equation (1), posteriors Pr(y 1 |x i ) = .51 and Pr(y 2 |x i ) = .49 would lead to x i being correctly classified into y 1 , and so would posteriors Pr(y 1 |x i ) = .99 and Pr(y 2 |x i ) = .01. The former set of posteriors is equivalent to the latter set as far as error rate is concerned; however, we intuitively consider the latter set "better" than the former set, and the measures we discuss in Section 3.1.2 indeed consider it as such. Note also that classification (as implemented by means of Equation (1)) is just a downstream application of the posteriors, and there are many such potential applications, such as (as already recalled in the introduction) ranking and cost-sensitive classification; rather than evaluating the posteriors by evaluating one of their potential applications, it seems more sensible to evaluate them directly, which can be done by means of the measures of Section 3.1.2.
In Reference [34], SLD was subjected to a small-scale experimentation, which involved the binary case only. The experiments we conduct in this article are instead carried out on a very large scale and involve both binary and multiclass classification.

EXPERIMENTS
In this section, we report systematic experiments in which, using a variety of datasets, learners, and amounts of distribution shift, we compare the quality of the priors and (above all) of the posteriors before the application of SLD with that after the application of SLD. This allows us to see when and in what conditions the application of SLD is beneficial.

Evaluation Measures
We evaluate SLD in terms of two main criteria, i.e., (i) the ability to improve the accuracy of the estimated class priors with respect to the trivial "Classify & Count" estimator, and (ii) the ability to improve the accuracy of the posterior probabilities with respect to the ones originally returned by the classifier.

Evaluating the Priors.
For evaluating the quality of the estimated class priors we use normalized absolute error (NAE) (see, e.g., Reference [35, §4.2]), defined as where p U andp U indicate the true class distribution and the predicted class distribution, respectively, on the set U of unlabelled documents. The reason we use NAE is that, besides its simplicity, it is also (as argued in Reference [35]) one of the theoretically most satisfying measures for evaluating the quality of class priors; NAE ranges between 0 (best) and 1 (worst). In all the tables of results that we include in Section 4, we compare the estimates of the class priors before applying SLD, computed by "classifying and counting," i.e., aŝ with the same estimates after applying SLD (which are the values ofPr U (y j ) resulting from Line 18 of Algorithm 1).

Evaluating the Posteriors.
For evaluating the quality of the posterior probabilities, the measure we use is the Brier score [4]. Given a set U = {(x 1 , t (x 1 )), . . . , (x |U | , t (x |U | ))} of unlabelled documents to be labelled according to codeframe Y, the Brier score is defined as where I (·) is a function that returns 1 if its argument is true and 0 otherwise. The Brier score ranges between 0 (best) and 1 (worst), i.e., it is a measure of error and not of accuracy. It rewards classifiers that return a high posterior for the true class of x i and low posteriors for all classes other than the true class of x i . The Brier score is an example of so-called strictly proper scoring rules [18], defined as loss functions that are minimized only when Pr(y j |x i ) equals 1 for y j = t (x i ).
It is useful to analyze the Brier score in a more fine-grained way. For class y j , let the [0,1] interval be partitioned into an ordered sequence of b intervals I 1j , . . . , I b j , and let us define bins B 1j , . . . , B b j such that x i ∈ B k j iff Pr(y j |x i ) ∈ I k j . DeGroot and Fienberg [7, §4] where Here, CE is a measure of the calibration error of the posterior probabilities; in fact, it is easy to see that its value is 0 if and only if Equation (3) is verified for each S ∈ {B 1j , . . . , B b j }. RE is instead a measure of what DeGroot and Fienberg [7] call the refinement error of the classifier, i.e., of the lack of confidence of its predictions; its value is 0 if and only if all the posteriors it returns have a value of 0 or 1, while its value is 1 if and only it the classifier always "sits on the fence," i.e., if all the posteriors it returns have a value equal to the prevalence of y j in U . 7 As an example, in a binary setting consider a perfectly balanced unlabelled set U , consider a ("perfect") classifier h that returns Pr(y j |x i ) = 1 for all x i whose true class is y j and Pr(y j |x i ) = 0 for all x i whose true class is not y j , and consider a classifier h that returns Pr(y j |x i ) = .50 for all x i ∈ U . Classifiers h and h are equivalent as far as CE is concerned (they both get a score of 0), but they are not for RE, which is equal to 0 for h and to .50 for h . Conversely, consider the same set U and the same ("perfect") classifier h of the previous example, and consider a ("perverse") classifier h that returns Pr(y j |x i ) = 0 for all x i whose true class is y j and Pr(y j |x i ) = 1 for all x i whose true class is not y j . Classifiers h and h are equivalent as far as RE is concerned (they both get a score of 0), but they are not for CE, which is equal to 0 for h and to 1 for h . Prediction power (which, in this case, manifests itself in the form of good-quality posteriors) thus requires calibration and refinement. Another way of saying this is that BS measures the classifier's knowledge, which is a combination of the classifier's introspection, or self-awareness (which is measured by CE), and of the classifier's confidence (which is measured by RE).
In this article, we define and use two variants of the Brier score, i.e., • the Isometric Brier Score (here shortened as BS L , where L stands for "length"), which is obtained by partitioning U into intervals I 1j , . . . , I b j of equal length; for instance, if b = 10, • the Isomerous Brier Score (here shortened as BS N , where N stands for "number"), which is obtained by partitioning U into intervals I 1j , . . . , I b j such that the corresponding bins B 1j , . . . , B b j have equal size, i.e., are such that (a) x ∈ B s j and x ∈ B t j with s < t implies that Pr(y j |x ) ≤ Pr(y j |x ), and (b) |B s j | = |B t j | for any s, t ∈ {1, . . . ,b}. Note that, when partitioning U this way, ν (B k j , U ) is the same for all 1 ≤ k ≤ b.
The advantage of BS N over BS L is that all bins are guaranteed to have a high enough number of elements, which reduces the risk that the difference between ρ (y j , B k j ) and π (B k j ) is extreme due to sparsity. In this article, we use the BS L variant for compatibility with previous literature (which mostly uses the BS L variant-e.g., References [3,37]), and the BS N variant because, as argued, it seems to have superior formal properties.
In the experiments reported in this article, we use b = 10.

Dataset
As the dataset, in our experiments, we use RCV1-v2, a dataset comprising 804,414 news stories published by Reuters from August 20, 1996, to August 19, 1997. 8 We here consider the "Topic" hierarchy, consisting of 101 classes. For text classification purposes, RCV1-v2 is traditionally split into a training set consisting of the (chronologically) first 23,149 documents (the ones written in August 1996) and a test set consisting of the last 781,265 documents (the ones written from September 1996 onwards). RCV1-v2 is multi-label, i.e., a document may belong to several classes at the same time; since, in this article, we are interested in single-label classification, we select its "single-label fragment," i.e., the subset of RCV1-v2 documents that have exactly 1 label. To do so, (a) we remove all "derived" labels, leaving only "primitive" labels, 9 and (b) we remove from the collection all documents that do not have exactly one "primitive" label.
For reasons that will be clear in Section 3.2.1, in our experiments, we consider only the 37 classes with at least 2,000 (training or test) positive examples; of these, 31 are "leaf" classes while the remaining 6 classes correspond to internal nodes of the hierarchy. 10 We also remove all documents that do not belong to any of these 37 classes, which leaves us with 517,978 documents.

Generating Samples with Controlled Amounts of Distribution
Shift. RCV1-v2 exhibits very little distribution shift between training set and test set. In fact, if we compute the normalized absolute error between p L (the class distribution in the training set) and p U (the class distribution in the unlabelled documents), i.e., for RCV1-v2, we obtain NAE = .0026, which is an extremely low value (since NAE always ranges between 0-indicating no shift-and 1-indicating maximum shift).
We instead want to test the SLD algorithm on a variety of distribution shift values, thus simulating a variety of possible application scenarios. 11 To do so, by using the protocol described below, we extract from RCV1-v2 k different samples, each consisting of a training set and a test set 8 Available from http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm. 9 The RCV1-v2 codeframe has a hierarchical structure. As a result, when a document is labelled with class y j , it is also labelled with all classes that are ancestors of y j in the RCV1-v2 tree. Whenever a document has two labels y and y such that y is an ancestor of y , we remove this "derived" label y from its labels; we are thus left with "primitive" labels (i.e., labels y j such that the document has no label that is a descendant of y j ). 10 Each of these latter 6 classes has at least 2,000 positive examples "of its own," i.e., such that none of its descendant classes has any of these examples. 11 Testing SLD against different values of distribution shift is of key importance also because one of the claims made by Saerens et al. [34, p. 31] is that their algorithm improves the prior estimates when distribution shift is substantial while it may actually worsen them in a "zero shift" setting. sampled from different class distributions; all the results of our experiments will thus be average values across these k samples.
We run binary, "one-against-the-rest" classification experiments, i.e., experiments in which, for each class y j , all the examples not belonging to y j are considered negative examples of y j . For these experiments: (1) We generate two random vectors Π L = (π L 1 , π L 2 ) and Π U = (π U 1 , π U 2 ) of class priors, i.e., two vectors such that 0 ≤ π L j , π U j ≤ 1 for each 1 ≤ j ≤ 2 and such that 2 j=1 π L j = 2 j=1 π U j = 1; 12 (2) We generate a training set σ L by drawing m L = |σ L | different documents (with m L a parameter to be fixed beforehand), where at each draw, we pick with probability π L j a random document among those belonging to class y j , and with probability (1 − π L j ) a random document among those not belonging to y j . We then generate a test set σ U by first removing from the pool the documents drawn for σ L , and then by drawing m U = |σ U | different documents (with m U a parameter to be fixed beforehand), where at each draw, we pick with probability π U j a random document among those belonging to class y j , and with probability (1 − π U j ) a random document among those not belonging to y j . We thus obtain a sample σ = (σ L , σ U ) with which we run a train-and-test experiment.
(3) We repeat the two steps above k times for each class y j ∈ Y and average the results across these 37 × k train-and-test experiments.
We also run single-label multiclass classification experiments, using varying number of classes. For these experiments: (1) given a desired number n of classes, we randomly choose n of our 37 RCV1-v2 classes, thus obtaining codeframe Y, with |Y | = n; (2) we generate two random vectors Π L = (π L 1 , . . . , π L |Y | ) and Π U = (π U 1 , . . . , π U |Y | ) of class priors, i.e., two vectors such that 0 ≤ π L j , π U j ≤ 1 for each 1 ≤ j ≤ |Y| and such that we generate a training set σ L (respectively, a test set σ U ) by drawing m L = |σ L | (respectively, m U = |σ U |) different documents (with m L and m U two parameters to be fixed beforehand), where at each draw, we pick with probability π L j (respectively, π U j ) a document belonging to class y j . We thus obtain a sample σ = (σ L , σ U ) with which we run a trainand-test experiment; (4) we repeat the three steps above k times and average the results across these k train-andtest experiments.
In the experiments we run in this article, we use m L = m U = 1,000, and k = 500. The fact that, as previously specified, we only consider classes with at least 2,000 positive examples allows us to use m L = m U = 1,000, i.e., there would be enough positive training examples even if, in some of the k draws, π L j and π U j were both 1 for some y j . 13 We run multiclass experiments for all values of |Y | ∈ {5, 10, 20, 37}.
Thanks to the use of randomly generated drawing probabilities, the class distributions of both the training set and the test set of each sample are random, each class distribution is equiprobable, and the value of distribution shift (as measured by NAE) between the training set and the test set of each sample we generate is also random. The set of samples that we generate with this method is, since k is large enough, fairly representative of the entire spectrum of shift values.
Note that this strategy for generating samples characterized by random values of distribution shift is radically different from the one adopted, for instance, in References [10,16]. In these latter works there is no random component in picking class distributions or distribution shift values, and an equal number of samples is generated for all possible class distributions such that each class prior belongs to a finite set of values (e.g., {.00, .01, . . . , .99, 1.00}). However, those works deal only with the binary case, where the number of all possible such class distributions is small. In the general multiclass case (i.e., when |Y | > 2) this number is much higher, since it grows exponentially with |Y |; therefore, even generating a single sample for all possible class distributions such that each class prior is in {.00, .01, . . . , .99, 1.00} would be prohibitive even for small numbers of |Y |. The random strategy we adopt in this article thus allows us to avoid this pitfall.

Representing Text
We preprocess text by using stop word removal and no stemming. As the weighting criterion, we use a version of the well-known tfidf method, expressed as where #( f , x i ) is the raw number of occurrences of feature f in document x i ; weights are then normalized by means of cosine normalization.

Learners
In our experiments, we use four different learners, i.e., support vector machines (SVMs), logistic regression (LR), multinomial naive Bayes (MNB), and random forests (RFs). For all of them, we rely on the implementations available from the scikit-learn package. 14 For all of them, we use the default parameters of the scikit-learn implementation, since the possible accuracy improvements resulting from a parameter optimization based on k-fold cross-validation would be obtained at the expense of a very large computational cost. 15 This possible accuracy improvement would bring about no evident benefit to our study, since the goal of this work is not squeezing every possible drop of accuracy from our classifiers, but comparing the pre-SLD results with the post-SLD results in the same experimental conditions. The default values are as follows: • SVMs: we use soft-margin SVMs with linear kernel, L2 regularization with C = 1; • LR: we use L2 regularization with a regularization coefficient C = 1; • MNB: We use Laplace smoothing, with a = 1 as the additive factor; • RFs: we use 100 trees per forest, Gini impurity as the splitting function, no max depth, no pruning.
For each of these learners but SVMs, we use two versions, one with post-calibration of the posteriors that the learner returns (Calib), and the other without calibration (NoCalib). SVMs are an exception, because, as is well-known, the confidence scores they return are not probabilities, and the only way to have SVMs return probabilities in scikit-learn is to invoke a calibration routine; as a result, the only version of SVMs we experiment with is one with post-calibration. We perform calibration using the method proposed by Platt [31], sometimes known as "Platt scaling." 16 Given a confidence score s (x i , y j ) produced by a classifier, either in the form of a nonprobabilistic score or of a non-calibrated probability, we transform it into a calibrated probability Pr(y j |x i ) by applying the logistic transformation where the parameters α and β are determined by fitting a maximum-likelihood model on a set of scores S Calib = {s (x, y j )|x ∈ Tr Calib } produced by the classifier on some training documents Tr Calib .
If the same training documents that are used to train the classifier are also used for calibration, overfitting may happen. Held-out documents may be used, but this requires additional labelled documents. To avoid overfitting without requiring held-out documents, Platt suggests to collect the set of scores S Calib by performing cross-validation on the training documents. We have implemented this k-fold cross-validation procedure, performing 10-fold cross-validation on the training documents using the same learning algorithm that is separately used on the entire training set to learn the actual classifier. We obtain the scores S f Calib for each validation fold f ∈ [1, . . . , 10] and then optimize the parameters of Equation (14) on the resulting set of scores S Calib = f S f Calib . We then apply the optimized Equation (14) to the scores of the classifier trained on the entire training set; we refer to this process as the Calib version of the learner.

RESULTS
This section presents the results of our experiments. The code for reproducing them is available at https://github.com/HLT-ISTI/SLD-reassessment. At https://hlt-isti.github.io/SLD-visualization/, we also make available a visualization tool that shows, for various combinations of (number of classes, sample, learner, class) from our experiments,

Results of Binary Classification Experiments
The upper half of Table 1 17 Note that for R E a 16 We have implemented Platt scaling ourselves, since the version available from scikit-learn turns out to be not a faithful implementation of Platt's algorithm; see https://github.com/scikit-learn/scikit-learn/issues/16145 for a discussion. The code of our implementation is available at https://github.com/aesuli/scikit-learn/tree/platt. 17 In this table and in all the other tables in this article, some values of R E might not appear to be completely justified; for instance, when the transition from a Pre-SLD value of .001 to a Post-SLD value of .000 is indicated to correspond to a positive value indicates an improvement (i.e., that SLD had a beneficial effect) while a negative value indicates a deterioration. The rows of the table each correspond to one of the learners of Section 3.4, grouped into learners with post-calibration of the posteriors that the learner returns (Calib) and ones without such calibration (NoCalib). As indicated in Section 3.2.1, every row of this table is the result of 37×500=18,500 train-and-test runs; given that each of the 7 rows accounts for a different learner, this is a total of 18,500×7=129,500 train-and-test runs.
There are a number of observations that can be derived from the top part of Table 1: • In terms of the quality of the estimated priors (as measured by NAE), there is a very substantial difference between the performance of non-calibrated learners and that of calibrated learners: for the former, the application of SLD brings about an extremely large deterioration (an average of 72.7% across all tested learners), while it brings about a very good improvement (an average of 43.5% across all tested learners) for the latter. 18 This adds to the fact that calibrated learners have, on average, a much better NAE right from the start (.009, instead of .046 for the non-calibrated ones); this means that calibrating one's learner is a win-win move, given that it brings about much better posterior probabilities and that value R E = +79.3%. Of course, this value of R E derives from using the real Pre-SLD and Post-SLD values in much higher precision. We use the standard notation (e.g., .027) rather than the more precise E notation (e.g., 2.7E-3) for higher legibility. 18 This confirms an observation of Saerens et al. [34], according to whom "In order to obtain good a priori probability estimates [by means of SLD], it is necessary that the a posteriori probabilities relative to the training set are reasonably well approximated (i.e., sufficiently well estimated by the model)." these posteriors have much larger margins of improvement by means of the application of SLD. 19 • The very large magnitude of these improvements/deteriorations is not mirrored by analogous magnitudes when it comes to the quality of the posteriors. It is still true that deteriorations are observed in the case of non-calibrated classifiers (7.8% on average) and improvements are instead observed for the calibrated classifiers (33.0% on average), but the magnitudes of these variations are smaller. • SLD seems to have a much more beneficial effect in terms of calibration than in terms of refinement; in fact, improvements in BS, when present, are largely the responsibility of CE, while deteriorations in BS, when present, are largely the responsibility of RE. 19 Niculescu-Mizil and Caruana [29] state that "For learning methods that make well calibrated predictions such as neural nets, bagged trees, and logistic regression, neither Platt Scaling nor Isotonic Regression yields much improvement in performance even when the calibration set is very large. With these methods calibration is not beneficial, and actually hurts performance when the calibration sets are small." Our large-scale experimentation indicates that, while this might be true in the absence of distribution shift, when distribution shift is present any calibrated learner works better than its non-calibrated counterpart. In fact, note that the Pre-SLD values of NAE, BS, and CE for the calibrated learners are always substantially better than the values of the corresponding non-calibrated learners, and this also includes logistic regression, a learner that is known to return well-calibrated probabilities.
• While there are (even substantial) quantitative differences among learners belonging to the same category (non-calibrated or calibrated), there are very few qualitative differences, i.e., when a learner exhibits a deterioration in one of the measures, all other learners (with few exceptions) also exhibit a deterioration for the same measure. This seems to indicate that the results derive from inherent properties of the SLD algorithm, rather than from peculiarities of the individual learning algorithms.
The lower half of Table 1 presents the analogous results for the isomerous variants of BS, CE, RE.
(The results for NAE are the same as in the upper half, since the distinction isometric/isomerous does not apply to NAE.) The observations that can be made by looking at the lower half of the table are essentially the same as those derived from the upper half, since the results are qualitatively similar. There is one important difference, though, i.e., the fact that, when measured by means of the isomerous variant, RE is always 0 or very close to 0, which is far from being the case when using isometric RE. That this should be so is an obvious consequence of the definition of RE, as from Equation (11). In fact, since all bins are equally populated, it is clear from Equation (11) that RE only depends, for all bins B k j (1 ≤ k ≤ b) and for all classes y j ∈ Y, on the fraction ρ (y j , B k j ) of documents in the bin that belong to the class. However, for all bins, that fraction is the same in the Pre-SLD and Post-SLD distributions, because, as observed in Section 2, SLD is just a rescaling algorithm, which multiplies all the posteriors for a given class for the same constant but does not change the composition of the bins. That RE is not 0 when using the isometric variant is thus due to the fact that rescaling changes the compositions of the bins; for instance, a document that was in the bin corresponding to the [.9, 1.0] interval before the application of SLD, after SLD has been applied might be in the [.8, .9) bin if SLD has multiplied all the posteriors for that class by a factor smaller than 1. In this case, rescaling not only changes the composition of the bins, but also changes the number of documents they contain, thus potentially generating also very sparse bins. Interestingly, the fact that SLD could not reduce RE is reminiscent of an observation by DeGroot and Fienberg [7]: We then study the question of when an observer can use a forecaster's predictions to obtain a better score than the forecaster himself, and show that such an improvement can be achieved by the observer essentially if and only if the forecaster is not well calibrated.
Here, the forecaster is the classifier and the observer is SLD, who tries to obtain a better score (in terms of Brier score) than the classifier by "piggybacking" on the classifier's predictions. DeGroot and Fienberg [7] state that what SLD can at most hope for is to improve on the classifier's calibration error, but not on its refinement error. This shows that SLD is, in essence, a re-calibration algorithm, i.e., an algorithm for re-calibrating the posterior probabilities of documents belonging to an unlabelled set U , where these posteriors have been returned by a classifier already calibrated on a training set L, and where the re-calibration is made necessary by the fact that a prior probability shift between L and U has occurred. Tables 2 to 5 report the results of our experiments on multiclass classification. As indicated in Section 3.2.1, we run multiclass experiments with varying number of classes, starting from |Y | = 5 classes (Table 2) and moving up to |Y | = 10 (Table 3), |Y | = 20 (Table 4), and |Y | = 37 (Table 5), which is the total number of classes in our dataset. For |Y | ∈ {5, 10, 20}, the classes are randomly sampled from the entire set of 37 RCV1-v2 classes. As indicated in Section 3.2.1, every row of these four tables is the result of 500 train-and-test runs; given that each of the 7 rows accounts for a different learner, this is a total of 4×500×7=14,000 train-and-test runs.

Results of Multiclass Classification Experiments
There are several observations we can make by looking at these tables: • The main fact that emerges is that all quality indicators of SLD (i.e., the values of error reduction, for each of the four error measures we consider) drastically deteriorate when |Y | grows, for all learners, calibrated or not. Table 5, which reports results for |Y | = 37, indicates disastrous performance on the part of SLD on all counts. • Concerning SLD's impact on the priors, while the binary experiments had indicated a very positive impact (at least: for the calibrated learners), the multiclass experiments indicate a negative impact for |Y | = 5 (12.7% average deterioration across all calibrated learners) and an even more negative impact for |Y | ∈ {10, 20, 37}, with the average deterioration across all calibrated learners reaching up to 251.0% for |Y | = 37. • Concerning SLD's impact on the posteriors, while the binary experiments had indicated a very positive impact (at least: for the calibrated learners), the multiclass experiments indicate that this impact is still mildly positive for |Y | = 5 but becomes negative for |Y | = 10 and deteriorates even more for |Y | ∈ {20, 37}. For instance, BS in the isomerous variant has, thanks to SLD, an average improvement across the calibrated learners by 28.0% for |Y | = 2 and 6.4% for |Y | = 5, but this improvement becomes a deterioration for |Y | = 10 (28.2%) and for |Y | ∈ {20, 37} (e.g., 72.2% for |Y | = 37). This trend is even more marked for CE and indicates an improvement for |Y | = 2 (80.3%, average across the calibrated learners) and |Y | = 5 (19.7%) but a deterioration for higher values of |Y |, with the amount of deterioration reaching up to 937.2% for |Y | = 37.

Analyzing the Results by Amount of Shift
In this section, we analyze the relations between error and distribution shift. Our goal is that of highlighting, in the results of the experiments discussed in Sections 4.1 and 4.2, any noteworthy correlation between error reduction, for any of our four measures, and distribution shift.
In our analysis of the results, we have not been able to detect any significant correlation between NAE and distribution shift or between RE and distribution shift. As a result, from here onwards, we only concentrate on discussing BS and CE. Figure 2 plots the values of relative error reduction for the BS and CE measures (we here use the isomerous variants; the isometric variants return similar results) for the same experiments as discussed in Sections 4.1 and 4.2, for each of our four calibrated classifiers, 20 but with the samples binned into four quartiles according to how much distribution shift between the training set and test set the sample exhibits. The first quartile contains the samples characterized by the lowest amounts of distribution shift, and the fourth contains the samples characterized by the highest such amounts. The actual values of distribution shift (expressed in terms of NAE) that characterize the samples in each quartile are reported in Table 6. Each result reported in the plots is the average across all samples that belong to the bin.
A clear pattern emerges from the analysis of BS and CE values: For both measures, for both the binary and multiclass cases, and for all numbers of classes considered in the multiclass experiments (|Y | ∈ {5, 10, 20, 37}), performance tends to improve monotonically with the amount of distribution shift. (Exceptions do exist for individual classifiers, but the average values across the four classifiers exhibits strict monotonicity.) This happens both for the cases (binary case + multiclass case with |Y | = 5) in which SLD has a positive impact (i.e., error diminishes as a result of its application), and for the cases (multiclass case with |Y | > 5) in which the impact of SLD is negative (i.e., error increases as a result of its application); in the former cases the magnitude of error reduction increases with the increase in shift, while in the latter cases the magnitude of error amplification diminishes with the increase in shift. Case |Y | = 5 seems the threshold here, with SLD yielding a decrease in error for the two quartiles representing low shift and an increase in error for the two quartiles representing high shift. Together with the analyses presented in Sections 4.1 and 4.2, this observation suggests that SLD should be used to improve the quality of the prior probability estimates and of the posterior probabilities, only (a) when the classifier has been calibrated, and (b) the number of classes in the codeframe Y is low (say, |Y | ≤ 5), and (c) when the amount of distribution shift is high enough. 21

Analyzing the Distributions Produced by SLD
In the previous sections, we have evaluated, among other things, the impact of SLD on the difference between the predicted class distribution and the true class distribution by using the NAE measure. In this section, we instead look at the impact of SLD on two intrinsic characteristics of class distributions, i.e., their entropy and their shape. To do so, we compare, for a given train-andtest run, the four class distributions involved: (a) the true class distribution of L, (b) the true class distribution of U , (c) the class distribution of U predicted by the classifier (which SLD receives as input), and (d) the class distribution of U returned by SLD. This allows us to better understand the impact of SLD and the reasons behind some of the patterns highlighted in Sections 4.1 through 4.3.

Average Entropy of Class Distributions.
For each of the 129,500+14,000=143,500 train-andtest runs we have discussed in Sections 4.1 and 4.2, we measure the entropy of the four class distributions (a) to (d) mentioned in the previous paragraph. In each case, we set the base of the logarithm to the number |Y | of classes of the distribution being observed, so entropy values always range between 0 and 1. A low entropy value means that most of the documents in the sample belong to one or few classes, while a high entropy value means that the documents are spread fairly evenly across the entire set of classes. Table 7 shows the average value of the entropy of the four class distributions (a) to (d) (which in this and in the following tables will be indicated as L, U , Pre-SLD, and Post-SLD, respectively) across all the 143,500 train-and-test runs.
From this table, we can observe that the values for L and U are the same. This is intuitive, since, even though the training set and the test set of a given sample have in general two different class distributions, the sampling method for generating training sets and test sets is the same and the pool of documents from which to sample is the same, so training sets and test sets will exhibit, on average, the same class distribution. In the following, we will not report the entropy values for L, since those for U are always practically identical.
The average entropy value of Pre-SLD class distributions is slightly higher than those for L and U , but not substantially so. However, what immediately jumps to the eye is that the average entropy value for Post-SLD is much lower than the values for L, U , and Pre-SLD. Similar distributions exhibit similar entropy values, so a difference in entropy values is a clear indicator of a dissimilarity between the two observed distributions. The sharp difference between the Pre-SLD and Post-SLD average entropy values thus unequivocally indicates that SLD substantially alters the Pre-SLD class distribution, and the sharp difference between the U and Post-SLD values indicates that this alteration is detrimental. Since a high entropy value indicates a highly uniform distribution, the above results indicate that SLD has a tendency to sharply diminish this uniformity and label most of the documents with one or few classes. Table 8 reports again entropy values of class distributions, but averaged across all runs characterized by the same number of classes in the codeframe. An analysis of this table shows that values for U tend to increase as the number |Y | of classes in the codeframe increases. This is due to the sampling method, which generates prior probabilities with mean equal to 1 |Y | and variance equal to 1 |Y | 2 (as will also be evident from Figures 4 to 7). The values for Pre-SLD follow the same trend as values for U , but the values for Post-SLD have an almost opposite trend: As the number |Y | of classes in the codeframe increases, SLD decreases the uniformity of class distributions. Table 9 reports again entropy values of class distributions, but averaged across all runs that use the same learning algorithm.
The application of SLD following the use of the No-Calib learners brings about an even stronger divergence between the U values and the Post-SLD values than the corresponding Calib versions, thus confirming that non-calibrated classifiers are not fit for use with SLD. The application of SLD drastically reduces the average entropy for all learners, thus indicating that the decrease in uniformity of the distributions is less related to the chosen learning algorithm than to the number |Y | of classes in the codeframe (see also Section 4.4.2). Table 10 shows the average entropy values of the class distributions for all the possible combinations of number of classes and learners.
From this table, we can identify the very few cases in which the U class distributions have an average entropy value closer to the Post-SLD value than to the Pre-SLD value: this happens only for |Y | = 2 with calibrated SVMs, MNB, and RF.

Histogram-based Representations of Class Distributions.
In this section, we display and comment on histograms that indicate how class prevalence values are distributed in U , Pre-SLD, and Post-SLD class distributions resulting from the use of specific learners and with specific numbers of classes. Figure 3 does this for |Y | = 2, i.e., the binary classification case. As an example, the histogram in its left bottom subfigure ("Pre-SLD -Random Forests") shows that, if we pool together all the 37×500=18,500 train-and-test runs where RF was used as the learner, the results returned by the classifier (i.e., Pre-SLD) are such that a high number of classes have a prevalence of about 50% (i.e., Pr(y) = .5), a slightly lower number of classes have a prevalence of 40%, . . . , and a very small number of classes have a prevalence of 0%. Every subfigure of Figure 3 is, of course, bilaterally symmetric, since we are in the binary case, in which Pr(y) = α entails Pr(y) = (1 − α ). The top row of the figure (orange color) refers to the U class distributions (the left and right histograms are the same); the other histograms in the left column refer to Pre-SLD class distributions, one for each of the seven learning algorithms, while the other histograms in the right column refer to Post-SLD class distributions for the same algorithms. Figures 4 to 7 do the same for |Y | ∈ {5, 10, 20, 37}; these histograms are, of course, not bilaterally symmetric. 22 Figure 3 shows that all the methods produce class prevalence values that are distributed more uniformly than the true ones, i.e., many Pre-SLD or Post-SLD distributions generate many class prevalence values with very low or very high values. What is more important, though, is that for each learning method the difference between the U histogram and the Post-SLD histogram is larger than the difference between the U histogram and the Pre-SLD histogram; in other words, this confirms that SLD alters the Pre-SLD class distribution and that this alteration is detrimental. However, what we learn from these histograms, and that we had not learned from the entropy study of the previous section, is how SLD alters this distribution: it does so by generating fewer class priors with mid values, i.e., close to 50%, and more class priors with extreme values, i.e., close to 0% or 1% (to see this better, note that the Y axes of the left subfigure and the right subfigure are often not on the same scale).
It is evident from Figure 3 that SLD's impact in altering the distribution is substantial for each of the four calibrated learners (4th to 8th rows), and it is even more for the non-calibrated ones (2nd to 4th rows). When SLD is run on the posteriors generated by these latter learners, all class priors except 0 and 1 become much more frequent, and class priors equal to 0 and 1 increase dramatically with respect to the Pre-SLD case.
In Figures 4 to 7, which represent the multiclass case with |Y | ∈ {5, 10, 20, 37}, these trends are increasingly evident, and the deterioration introduced by SLD reaches disastrous levels for |Y | = 37. Post-SLD average class distribution becomes increasingly skewed when |Y | grows, and this concerns both calibrated and non-calibrated learners (although the latter are impacted to a much higher degree). While for the |Y | = 2 case class priors equal to 0 and class priors equal to 1 were both prevalent, in the multiclass cases class priors equal to 1 are practically absent and, as |Y | grows, the histogram becomes increasingly skewed and class priors equal to 0 become the overwhelming majority. Overall, what all these histograms show aligns very well with our experimental findings of Sections 4.1 and 4.2, i.e., with the facts that SLD works better with calibrated than with noncalibrated classifiers, and with the fact that it works better for small values of |Y | and (much) worse for high values of |Y |. They also show something more, i.e., that the reason of the bad behavior is the fact that SLD has, especially when |Y | is high and/or when classifiers are non-calibrated, a tendency to return many class priors equal to 0 and few class priors different from 0.

On the Speed of Convergence of SLD
As indicated in Section 2, in our experiments, we stop SLD when we have reached either convergence (which we take to mean that AE(p (s−1) U ,p (s ) U ) < 10 −6 ) or the maximum number of iterations (that we set to 1,000). Table 11 reports, for each learner and for each number |Y | of classes, • the average number of iterations (column "#") SLD required to reach convergence (when convergence was actually reached-the value 1,000 is used when convergence was not reached), where the average is computed across all the train-and-test runs we have performed; • the percentage of cases (column "%") in which convergence was not reached, and processing had to be stopped after 1,000 iterations.   There are three conclusions that can be reached from this table, i.e., (1) For a given number of classes, convergence tends to be quicker when the Pre-SLD posteriors have been obtained by calibrated learners; this is always true for LR and RF, although it is always false for MNB. The difference between the two versions (non-calibrated and calibrated) of LR is somehow surprising, since LR is often presented as an algorithm that naturally returns calibrated probabilities, i.e., a classifier that does not need post-calibration; our results throughout this article instead show that post-calibration is beneficial for LR, too.
(2) For a given learner, the number of iterations required to reach convergence grows monotonically with the number |Y | of classes considered. (3) For a given learner, the percentage of cases in which convergence is not reached grows monotonically with the number |Y | of classes considered.
These findings constitute yet another argument in favor of calibrated learners and yet another reason why the use of SLD should be contemplated only when the number |Y | of classes is small.

WHAT KIND OF DISTRIBUTION SHIFT DO WE SIMULATE IN OUR EXPERIMENTS?
In Section 3.  To distinguish different types of dataset shift (of which distribution shift is a type), Moreno-Torres et al. [24] distinguish (along with Fawcett and Flach [13]) between "X → Y problems" and "Y → X problems." Problems of type X → Y are ones in which it is the values of the features in x that stochastically determine the class y = t (x) to which x belongs. An example of a X → Y learning problem is weather forecasting, since it is a number of climatic conditions (for instance, pressure, temperature, humidity, and so on, that can be represented in a feature vector x) that determine whether it is going to snow or not (a fact that can be represented by a binary dependent variable y). In these cases, it is useful to write the joint distribution p(x, y) as Equation (16) suggests that there are two phenomena (or, of course, a combination of both) that can cause p(y) to vary across L and U , i.e., (1) Covariate shift, defined as the case in which p L (y|x) = p U (y|x) and p L (x) p U (x); (2) Concept shift, defined as the case in which p L (y|x) p U (y|x) and p L (x) = p U (x).
For instance, in the example above, if the distribution of climatic conditions change, the probability that it is going to snow changes, too; this is a case of covariate shift. Instead, if the causal relationship between climatic conditions and snowing were to change (an admittedly unlikely case), this would be a case of concept shift.
Problems of type Y → X are instead ones in which the class y = t (x) to which document x belongs stochastically determines the values of the features in vector x. An example of a Y → X learning problem is authorship attribution, i.e., the task of determining the author (from a set of |Y | candidate authors) of a text of unknown or disputed paternity [20], a task that is usually carried out by using as features a number of "stylistic" traits that tend to characterize an author's writing style. Authorship attribution is an Y → X problem, since it is the fact that a certain text is, say, Shakespeare's, that causes it to have certain stylistic characteristics, and not the other way around. In these cases, the joint distribution p(x, y) can be usefully written as Here, p(y) can vary for independent reasons (since y is a cause and not an effect), a phenomenon that is usually called prior probability shift. For instance, in Stratford-upon-Avon's municipal library there might proportionally be more books by Shakespeare than in any other municipal library. (Note that p(x|y) may vary, too, but this is not our concern, since it would not cause distribution shift anyway.) 23 So, what kind of distribution shift are we simulating with the sampling strategy of Section 3.2.1, exactly?
If our dataset is from a X → Y problem, we are certainly simulating covariate shift but not concept shift; in fact, we are selectively removing documents (which means that p(x) changes) but we are not making the causal relationship between X and Y change (which means that p(y|x) does not change), since the documents that are not removed still have the same class label. Conversely, if our dataset is from a Y → X problem, we are simulating prior probability shift, because by selectively removing documents we are making p(y) change.
So, what we are simulating with the sampling strategy of Section 3.2.1 is covariate shift and/or prior probability shift, but not concept shift.
There are two reasons for this: • While a strategy that also simulates concept shift might have been better, since it would have allowed us to test the SLD algorithm in a broader set of challenging situations, it is not clear how concept shift should be simulated, since this would involve changing the class labels of documents that are included in a sample, and it is unclear whether there are sensible policies for doing it. • SLD was conceived for handling not concept shift but distribution shift; it would thus probably make no sense to simulate situations for which SLD is intentionally unequipped. 24

RELATED WORK
Despite having been proposed more than 15 years ago, SLD remains an algorithm unique in its kind, since at the same time it updates the posterior probabilities and the class prior probability estimates returned by the classifier. 23 Note also that it is not always easy to characterize with certainty a given problem as being of type X → Y or of type Y → X; sometimes this question looks a bit akin to wondering which of chicken and egg came first. As a result, different types of dataset shift (covariate shift, concept shift, prior probability shift) that concur in causing distribution shift may be at play at the same time. 24 Saerens et al. [34] explicitly assume "that the generation of the observations within the classes, and thus the within-class densities, p (x |y ), does not change from the training set to the new data set (only the relative proportion of measurements observed from each class has changed). This is a natural requirement; it supposes that we As discussed in the previous sections (and, especially, in Appendix A), SLD bears strong relations to probability calibration. While several calibration methods have been proposed in the past 20 years (e.g., References [1,3,6,27]), none of them actually deals with calibrating the posterior probabilities of the unlabelled set in the presence of distribution shift.
As already mentioned throughout the article, dataset shift (and distribution shift in particular) is central to SLD's concerns. Dataset shift (the word "shift" and "drift" sometimes being used interchangeably) is a multifaceted phenomenon and a largely unexplored territory, and only in the past 10 years or so the machine learning community has started to address it systematically [33]. The task of estimating class prior probabilities in the presence of distribution shift has, since about 2005, evolved as a task of its own, called quantification [19], and many algorithms alternative to SLD have been proposed (see References [11,14,32,36] for a few recent examples). However, while these algorithms are interesting alternatives to SLD as far as estimating class prior probabilities goes, there are no current alternatives to SLD when it comes to adjusting the posterior probabilities in the presence of distribution shift. To the best of our knowledge, the only alternative to SLD that has ever been proposed for adjusting the posterior probabilities in the presence of distribution shift is the algorithm in Reference [38], based on the idea of binning the unlabelled documents based on an invariance property of ROC curves. However, this algorithm assumes that the true class priors in the unlabelled set are known; this is an assumption that is not verified in practice (because, in the presence of distribution shift, these class priors are different from the ones in the training set), which means that this algorithm cannot be used in practice. 25

CONCLUSIONS
We present a thorough reassessment of SLD, a well-known algorithm that, given a machine learned single-label classifier and a set of unlabelled documents characterized by distribution shift with respect to the labelled documents the classifier has been trained on, adjusts the posterior probabilities and class prior probability estimates returned by the classifier, in an iterative, mutually recursive way, with the goal of making both more accurate. Since its publication more than 15 years ago, SLD has become the standard algorithm for improving the quality of posterior probabilities, and it is still considered a contender when it comes to estimating the class prior probabilities on unlabelled sets. However, its real effectiveness at improving the quality of posterior probabilities has been questioned. Studying SLD is thus not just an academic exercise, and is still important, since no other algorithm for adjusting the posterior probabilities returned by a classifier in the presence of distribution shift is known, and since the quality of posterior probabilities is of key importance for a number of document management tasks, including document ranking and cost-sensitive text classification.
We here present the results of a large-scale experimentation that uses multiple learners and a very large, publicly available dataset for text classification, on which multiple amounts of distribution shift (i.e., difference in the distribution of prior probabilities between the training and the unlabelled documents) have been simulated. In total, the experimentation consists of 129,500 train-and-test runs for the binary case and 14,000 such runs for the multiclass case. In these experiments, we are especially interested in SLD's ability at improving the quality of posterior probabilities, something that Saerens et al. [34] evaluated only indirectly, i.e., in terms of the accuracy of (cost-insensitive) classification that results from using the posteriors SLD generates.
Our study allows three main conclusions. The first conclusion is that SLD is ineffective, and often detrimental, when the classifier has not been previously calibrated; in this latter case, an additional disadvantage is that the speed of convergence of SLD is slower, and the probability that the computation does not even converge is higher. The second conclusion is that, in any situation, the improvements that SLD brings about are higher (or the deterioration it brings about is lower) when distribution shift is higher. The third conclusion is that the improvements that SLD brings about are higher (or the deterioration it brings about is lower) when the number of classes in the codeframe is small; binary classification is thus the most apt context for the use of SLD, which should instead be used with prudence in multiclass classification with small numbers of classes and completely avoided in multiclass classification with high numbers of classes. An additional disadvantage of working with a high number of classes is that, as for non-calibrated classifiers, the speed of convergence of SLD is much slower, and the probability that the computation does not even converge is much higher. 26 Our results also show that, concerning the improvements in the quality of the posteriors that have been found in the binary case (and, to a lesser extent, in the multiclass case when the codeframe is small), these are due to a reduction of the calibration error and not to a reduction of the refinement error. This shows that SLD is, in essence, a re-calibration algorithm, i.e., an algorithm for re-calibrating the posterior probabilities of documents belonging to an unlabelled set U , where these posteriors have been returned by a classifier already calibrated on a training set L and where the re-calibration is made necessary by the presence of prior probability shift. For this kind of use, and when the number of classes |Y | is small and the classifiers have been calibrated beforehand, the use of SLD is still recommended.

APPENDIX A SLD'S GOAL IS TO ENFORCE THE MUTUAL CONSISTENCY OF THE POSTERIORS AND THE PRIORS OF U
We here show that SLD may be viewed as an attempt to enforce a necessary condition for the posteriors Pr(y j |x i ) of the documents x i ∈ U to be calibrated. To show this, let us define • U a to be the set of documents x i ∈ U such that Pr(y j |x i ) = a; • U j to be the set of documents x i ∈ U such that x i ∈ y j ; • U j a to be the set of documents x i ∈ U a ∩ U j .
Recall from Section 1 that the posteriors Pr(y j |x i ), with x i ∈ U , are perfectly calibrated when, for all a ∈ [0, 1], it holds that |U j a | |U a | = a. If so, then it holds that Pr(y j |x i ). 26 Note that our results do not contradict the original results of Saerens et al. [34], since these authors, while presenting SLD as a general-purpose multiclass algorithm, only run (900) binary classification experiments.
Since U is finite, there is a finite set A of values that the posteriors of the documents in U take. From Equation (18) which can be rewritten as By multiplying both sides by 1 |U | , we obtain which is exactly the condition on the "mutual consistency" of priors and posteriors that SLD tries to enforce (see Equation (5) and Step 11 of Algorithm 1) and that holds after SLD has converged. In sum, for the posteriors Pr(y j |x i ) of the documents x i ∈ U to be calibrated, Equation (21) must hold. While SLD is not a full-fledged attempt to calibrate the posteriors in U (which would be impossible, since we do not know the label of any document in U ), it may nevertheless be seen as a step in that direction.