A distribution-free soft-clustering method for preference rankings

Typically, ranking data consist of a set of individuals, or judges, who have ordered a set of items—or objects—according to their overall preference or some pre-specified criterion. When each judge has expressed his or her preferences according to his own best judgment, such data are characterized by systematic individual differences. In the literature, several approaches have been proposed to decompose heterogeneous populations of judges into a defined number of homogeneous groups. Often, these approaches work by assuming that the ranking process is governed by some distance-based probability models. We use the flexible class of methods proposed by Ben-Israel and Iyigun, which consists in a probabilistic distance clustering approach, and define the disparity between a ranking and the center of a cluster as the Kemeny distance. This class of methods allows for probabilistic allocation of cases to classes, thus being a form of soft or fuzzy, clustering. The allocation probability is unequivocally related to the chosen distance measure.


Introduction
Preference rankings are data-expressing preferences of several individuals over a set of available alternatives.Statistical methods and models for the analysis of preference rankings can be distinguished in methods based on badness-of-fit functions, which aim to describe the structure of rank data, and methods based on probabilistic models, which aim to model either the ranking process or the population of judges (Marden 1995).Examples of the former category include the least squares methods proposed by Van Buuren and Heiser (1989), De Soete and Heiser (1993), and Vera et al. (2009and Vera et al. ( , 2013)).Within this latter category, either homogeneity or heterogeneity among the judges can be assumed, and in the latter case, the goal is generally the identification of homogeneous sub-populations.
When, in addition to rank data, also some information about the judges are known, i.e., a set of covariates is available, a variety of models and methods have been introduced, such as generalized linear-like models for rank data (Dittrich et al. 1998(Dittrich et al. , 2000;;Böckenholt 2001;Gormley and Murphy 2008a) and recursive partitioning methods (D'Ambrosio 2008; Lee and Yu 2010;Strobl et al. 2011;D'Ambrosio and Heiser 2016;Plaia and Sciandra 2017).
Among clustering methods for rank data, most of the time, they assume mixtures of Bradley-Terry-Luce models or mixtures of distance-based models (Croon 1989;Murphy and Martin 2003;Gormley and Murphy 2008b).
Other approaches work under (Bayesian) Placket-Luce models (Mollica and Tardella 2017) or under some recently introduced models for rank data, such as the Insertion Sorting Rank model (Jacques and Biernacki 2014).In all these cases, either the choice of the distance measure or the assumption of the right model plays a key role in obtaining the solution.
In this paper, we propose a clustering approach for rank data that follows the probabilistic distance clustering class of methods defined by Ben-Israel and Iyigun (2008), that allows a flexible way to find homogeneous sub-groups without any assumption on the model that in the population generates the preferences.By following this approach, according to which the probability of each judge to belong to a given cluster is unequivocally related to the distance between himself and the cluster center, we use the Kemeny distance (Kemeny and Snell 1962).This choice is due to the consideration that the Kemeny distance is defined in the space of both full and tied rankings and that it is the unique distance measure defined on the extended permutation polytope, which is generally accepted as the geometrical space of preference rankings (Heiser and D'Ambrosio 2013;D'Ambrosio et al. 2017).
The remainder of this paper is organized as follows: Sect. 2 briefly introduces rank data.In Sect.3, our proposal is described in detail.Section 4 is devoted to both a simulation study and an analysis of a real data set, and Sect. 5 concludes this paper with some discussion.

Ranking data: an overview
There are two ways to represent ranking data.An ordering O is a list of n items that are ordered by an individual according to his/her preference from the most preferred the last preferred.A rank vector (or ranking) R of dimension n is a permutation of the first positive i = 1, … , n integers in the case of full rankings, where the number indicates the position (the rank) of the ith item in the ordering.When more than one item has assigned the same integer, the resulting ranking is called a tied ranking, or sometimes bucket order.
1 3 Behaviormetrika (2019) 46:333-351 It is generally accepted that the geometrical space of preference rankings is the permutation polytope (Thompson 1993;Marden 1995;Heiser and D'Ambrosio 2013), which is the convex hull of all possible rankings R s of the coordinates of the points of the rank scores in the n-dimensional real space, whit s = 1, … ,  where  represents the universe of all rankings of n items.
Accepting this particular geometry involves some consequences in the choice of the metric to be used in a clustering framework.Among the distances that have been defined for preference rankings, the unique measures that are completely defined on the permutation polytope and, at the same time, preserve the geodesic nature of this geometrical space are the Kendall distance (Marden 1995) and the Kemeny distance (Kemeny and Snell 1962).The difference between the two distances consists in the way that these measures deal with tied rankings: while the Kemeny distance maintains all the properties of a proper metric, the Kendall distance fails the triangular inequality (Emond and Mason 2002) in this particular case.This difference is a matter of no secondary importance, because nowadays, dealing with tied rankings is the rule rather than an exception.Sometimes, the number of objects is really large (rankings of the European Universities given by several agencies and rankings of the Netflix series given by the users), and sometimes, the number of objects is not so large [Presidential political elections (Gormley and Murphy 2008a, b), wine-tasting experiments, and the famous APA election data (Diaconis 1988)].The common element in these data is that either there are many items in a tie at several positions or there are incomplete rankings, which also involves a tie block among the objects in the last position.
By following the matrix representation of a ranking already defined by Kendall (1938), according to which any ranking R can be represented by a squared n × n matrix with elements r ij = 1 if the ith item is preferred over the jth item, r ij = −1 if the jth item is preferred over the ith item and r ij = 0 if the items are in a tie, Kemeny and Snell (1962) defined the distance between two rankings R and R * as Recently, Emond and Mason (2002) defined a new rank correlation coefficient named X , tau extension, and they proved that it is related to the Kemeny distance by the relation , where n(n − 1) is the maximum Kemeny distance.

K-median cluster component analysis
By following the probabilistic clustering framework introduced by Ben-Israel and Iyigun (2008), according to which distances and probabilities are inversely related, we propose a soft-clustering algorithm for rank data, where each ranking is assigned to all K clusters by a membership probability.Given a set of m individuals, the loss function of our K-median cluster component analysis (CCA) is where p k (R s ) is the probability to allocate the ranking R s to the cluster component k, whose cluster center is denoted by C k , and d(R s , C k ) is the Kemeny distance defined in Eq. 1.Here, we assume that the cluster center is the so-called consensus ranking, namely, that ranking that best represents the consensus opinion of a sample of judges.Finding the consensus ranking is a separate topic and a very old (and an NP-hard) problem (de Borda 1781; Condorcet 1785; Black et al. 1958;Emond and Mason 2002;Ali and Meila 2012;Aledo et al. 2013;Amodio et al. 2016b;D'Ambrosio et al. 2017).Here, we consider the consensus ranking that ordering that Kemeny and Snell (1962) defined as the median ranking, namely, that ranking that minimizes the sum of the Kemeny distance between itself and the rankings expressed by a set of judges.
As reported in Ben-Israel and Iyigun (2008), to which we refer for a thorough discussion, the stationary equation of the Lagrangian of the CCA optimization problem (2), which includes the constraints that (a) the membership probabilities are nonnegative and (b) they sum to one for each individual, is equivalent with the principle of probability being inversely related to distance, as follows: The problem of finding the consensus ranking in (2) for cluster k leads to, for each k which can be interpreted as the median (or consensus) ranking weighted by the (squared values of the) kth column of the allocation matrix containing the P k (R s ) values for s = 1, … , m .By looking at both Eqs. 2 and 4, it is clear that our opti- mization problem is equivalent to find K median ranking(s) weighted by the same squared probabilities.It is well known that finding the median ranking is an NPhard problem (Bartholdi et al. 1989;Emond and Mason 2002;Amodio et al. 2016b), being a combinatorial optimization problem defined on a discrete space.For this reason, special attention must be given to the accuracy of the solution of the median ranking problem.
Output of the CCA consists in the allocation membership matrix , which for the sth ranking is expressed as (see Ben-Israel and Iyigun 2008, Eq. 7) and the median ranking defined in (4) of each cluster component. ( Behaviormetrika ( 2019) 46:333-351 The CCA algorithm is summarized as follows.
1. Initialize K randomly chosen cluster centers; 2. compute the membership probabilities as in Eq. 5; 3. update cluster centers as in Eq. 4. Compute the value of the loss function as defined in Eq. 2; 4. repeat steps 2 and 3 until the difference of loss between the current iteration and the previous iteration is less than a given threshold.
As a measure of homogeneity within each cluster, the weighted averaged X rank correlation coefficient is computed as follows: About the stopping rule, we set the threshold equal to 1e − 6 .In general, the algo- rithm should continue until the cluster centers do not change any more.Practically, as the solution of the median ranking problem could produce multiple solutions [all with the same goodness of fit measure (Kemeny and Snell 1962;Emond and Mason 2002)], this convenient threshold allows to stop the procedure when there are no changes in the internal homogeneity of the clusters.As a measure of global homogeneity, the global weighted X rank correlation coefficient is computed as is the estimated proportion of cases belonging to clus- ter component k.
Ben-Israel and Iyigun (2008, Eq. 10) also provided the joint distance function as a measure of the classifiability of a given point in the clustering process: it is equal to zero if and only if a ranking coincides with one of the cluster centers and it increases as the distances increase, showing uncertainty in assigning a data point to a cluster.If we denote, for a ranking R s , the joint distance function with D s , a useful normalized classifiability measure can be defined as By interpreting the value 0/0 as zero, Eq. 8 returns a better measure of classifiability than the joint distance function, because its range is in [0, 1], it is equal to zero if and only if any distance is zero (i.e., R s coincides with C k ), and it is equal to one when the allocation probability for R s is equal to 1/K for all k = 1, … , K .As a global measure of normalized classifiability, we define the quantity: which gives an idea of the global performance of the algorithm.Moreover, Eq. 9 decreases monotonically with the number of clusters.This decrease is steep until the appropriate number of cluster centers is reached.We can use the values of U as in a scree plot, looking for a elbow in the graph showing the number of cluster centers on the horizontal axis and the corresponding values on the vertical axis.

Experimental evaluation
To evaluate the behavior of the CCA algorithm, both artificial data sets with known cluster structure were generated and one application on a real data set have been performed.

Simulation study
We designed a factorial experiment, which included three factors (number of items, number of clusters, and level of noise) with two or three levels.As a ranking generator process, we used the Mallows-model (Mallows 1957), the Mallows-Bradley-Terry (MBT) model (Bradley and Terry 1952;Mallows 1957), and the Insertion Sorting Rank (ISR) model (Biernacki and Jacques 2013).The choice of generating preference rankings according to several distribution models is due to the fact that we are proposing a non-parametric method, designed to work without fitting any particular parametric distribution model governing the ranking generator processes.Of course, the choice of the distance measure plays a key role for the CCA performance.We choose to use the Kemeny distance because of its flexibility in dealing with any kind of ranking.On the other hand, we are able to compare our results with parametric approaches such as mixtures of distance-based models for rank data (MDB) (Murphy and Martin 2003), which assume that rankings are generated by a Mallows model, and the ISR modelbased clustering (Jacques and Biernacki 2014).With the MBT model, we generate ranking that are 'neutral' with respect to all the methods in terms of ranking generator process.
According to the Mallows model, given a central ranking C and a precision parameter , the likelihood is defined as where ( ) is a normalizing constant (10).When d(⋅) is the Kendall distance, model 10 is known as the Mallows-model.The precision parameter governs the noise level around the central ranking C. When = 0 , then a uniform distribution is assumed and all the rankings in the universe have the same probability to be sampled.The larger , the more the central ranking C is representative of the sample.
The Mallows-Bradley-Terry model is a probability model on rankings with nonnegative parameters v i , with i = 1, … , n , where n is the number of items.For each pair of items i < j , let p ij be the probability that item i is preferred to item j in their paired comparison.In addition To any ranking R, MBT model associates a probability: The v parameters can be manipulated in such a way to control the noise level: the more the values of the v are close to each other, the more the distribution of rankings in the universe tends to the uniform distribution and produces high noise.Conversely, the more the v parameters are distant to each other, the more the distribution of rankings is picked around the reference permutation.
The insertion sorting rank (ISR) model assumes that a rank datum is the result of a sorting algorithm based on paired comparisons (Biernacki and Jacques 2013).Given a ranking R, a central ranking C and a parameter ∈ [0.5, 1] , the ISR model is defined as where is the probability of good paired comparison according to C, and the sum over R * ∈  corresponds to all the possible initial presentation orders of the objects to rank.The term G(R, R * , C) is the number of good paired comparisons during the sorting process leading to return R when the presentation order is R * and A(R, R * ) is the total number of paired comparisons.When = 0.5 , then the distribution of rankings in the universe is uniform.The more approaches to one, the more the distribution of rankings is concentrated around the central ranking C.
Table 1 summarizes the factors and the levels of the factorial design.
For each combination of conditions ten data sets were generated.The parameter of the Mallows model was set equal to 0.1, 0.5, and 0.9 for the cases of high, moderate, and low noises, respectively.
For the Mallows-Bradley-Terry model, we created the three levels of noise by referring to the following scheme.
For the insertion sorting rank model, the parameter was set to 0.6, 0.75, and 0.9 for the cases of high, moderate, and low noises, respectively.
About the clustering structure, one center always has been set as a random permutation.Other centers are random permutations subject to these rules.
-Two centers case: the second center is the reverse ranking of the first center plus a random noise that consists in a random interchange between adjacent items.-Three centers case: one center is the reverse ranking of the first center, plus a random noise.The other center is randomly selected by rankings whose distance from the first center is between 3∕4 n(n − 1) and n(n − 1) , plus a random noise.-Four centers case: one center is the reverse ranking of the first center plus a random noise.Another center is randomly selected by rankings whose distance from the first center is between 2∕4 n(n − 1) and 3∕4 n(n − 1) plus a random noise; the other center is the reverse ranking of the third center plus a random noise.
The sample size was always set equal to N = 150 .The proportion of cases was ran- domly chosen in such a way that the clusters are not extremely unbalanced.More precisely, in the case of two centers, a random number 1 between 0.35 and 0.65 was randomly generated by a uniform distribution.Then, the sample size N1 for cluster 1 was set as N × 1 (rounded up), and for the cluster 2, it was set as N2 = N − N1 .The same scheme was adopted for the case of three and four clusters.In the case of three centers, we generated two random numbers 1 and 2 between 0.22 and 0.37.In the case of four clusters, the proportions 1 , 2 , and 3 have been randomly generated by a uniform distribution between 0.2 and 0.3.For each condition and for each theoretical ranking generating model, ten data sets were generated, for a total of 540 data sets.To avoid some local optima, each algorithm was ran with ten random replications.In our experience, ten replications are enough to overcome the local optima problem.Furthermore, the rank aggregation problem can be extremely computationally expensive when the number of objects to be ranked increases (Ali and Meila 2012;Aledo et al. 2013;Amodio et al. 2016b;D'Ambrosio et al. 2017).For each trial, the number of clusters was varied from 1 to 7. The number of clusters was chosen with the BIC criterion for the mixtures of distance-based models and for the insertion sorting rank clustering model, and with the scree-plot approach for the CCA.
As the clustering composition is known for all the data sets, the algorithms were externally evaluated through the Normalized Degree of Concordance (NDC) (Hüllermeier et al. 2012), which is the equivalent of the Rand Index for fuzzy partitions, and with the Adjusted Concordance Index (ACI) (Amodio et al. 2016a), which is the equivalent of the Adjusted Rand Index to fuzzy partitions.
As the algorithms are quite different, we decided to not show measures of internal validation for clustering.In any case, we can report that all the algorithms returned quite good measures of internal validation, such as the averaged entropy or the averaged correlation coefficient of the final solution, especially for the cases of moderate and low noise.
Both the simulations involving the ISR model and the ISR model-based clustering were performed using the Rankcluster R package (Grimonprez and Jacques 2016).Both the CCA and the MDB algorithm were written in both MatLab and R environment.Despite the fact that CCA is designed to work with both full and tied rankings, the reference universe for the sampling scheme is limited to the population of full rankings.We made this choice, because the MDB algorithm works only with full rankings, and when there are some items in a tie, the Rankcluster package considers the corresponding ranking positions as missing (Jacques et al. 2014).Nevertheless, the searching space of the cluster centers for the CCA algorithm is not limited to the universe of full rankings.In finding the median ranking, the CCA algorithm uses the branch-and-bound algorithm designed by Emond and Mason (2002) if the number of items is less than 15, and the FAST heuristic algorithm (Amodio et al. 2016b) or the Differential Evolution algorithm for consensus ranking detection (D'Ambrosio et al. 2017) if the number of items is less than 50 or larger than 50, respectively.

Simulation results
Figures 1, 2, 3 show the interaction box plots for the ISR, Mallows and MBT sampling schemes, respectively.Each figure shows in the first column the NDC index, as well as the ACI index is shown on the second column.The rows are referred to the simulations with two, three, and four centers.As expected, in the case of high noise, there are no remarkable differences among the algorithms for all the sampling schemes.
When the ranking generator process is governed by the ISR model, the ISR clustering algorithm outperforms both CCA and MDB, especially in the case of low noise (Fig. 1).The performances in terms of ACI are not so good in the case of moderate noise, but this consideration has to be evaluated bearing in mind that fuzzy and crisp partitions are compared.About the Mallowssampling scheme, MDB slightly outperforms the competitors especially in the case of low noise, as it can be noticed by looking at Fig. 2. The behavior of the CCA is similar to the one of the MDB.Indeed, both Kendall and Kemeny distances are equivalent in dealing with full rankings, in the sense that the Kemeny distance is twice the Kendall distance (Heiser and D'Ambrosio 2013).
In the case of Mallows-Bradley-Terry sampling scheme, CCA gives the better performances.In this case, the differences can be better appreciated by looking at the case of moderate noise (Fig. 3), while when the noise is low, the algorithms have a similar behavior with some differences in the variability of the results in terms of NDC.
For all the experimental situations, the values in terms of ACI are larger in the cases with two centers.This result is expected, given the way that the theoretical cluster centers have been generated (one random permutation and its reverse plus a noise).In the other cases, large values of ACI are recorded only in situations in which there is a low level of noise.It is worth noting that, even if the ACI index is the fuzzy equivalent of the Adjusted Rand Index (ARI), its values are slightly lower than the ones returned by the ARI especially when comparing a fuzzy and a crisp partition.
Summarizing the results of this simulation study, we can conclude that under the ISR-sampling scheme, both CCA and MBD have a quite similar behavior.All the algorithms tend to underestimate the number of clusters, in fact failing to recover the clustering structure, when there is a high level of noise.It should be noted that in such a case, data have been sampled from a 'quasi'-uniform distribution, indicating a severe lack of consensus within clusters.Both CCA and MDB often fail to recover the clustering structure also with moderate noise under the ISR-sampling scheme.
We decided to not report the execution time of the single algorithms, because the codes have not been written with the same programming language and have not been optimized for speed.Moreover, for both MDB and CCA, we used the MatLab version, as well as ISR has been used through the Rankcluster R package.All analyses have been made on a computer Intel Core i7-5500u 2.40 GHz with 8 GB of ram.

US ranking
The US rank data set is a random subset of the rankings collected by O'Leary Morgan and Morgan (2010) on the 50 American States.The number of items (the number of American States) is equal to 50, and the number of rankings is equal to 104.These data concern rankings of the 50 American States on three particular aspects: socio-demographic characteristics, health care expenditures, and crime statistics.There are 41 full rankings and 63 tied rankings, summarized as shown in the following table: The row named buckets show the number of different ranks assigned to the States.For example, there is one ranking of the 50 States with 35 different rank values, there are ten rankings of the 50 States with 45 rank vales, etc.The data are freely available on the internal repository of the R package ConsRank (D'Ambrosio et al. 2017).In terms of rank aggregation problem, a sample of rankings with n = 50 items is consid- ered a very hard problem to solve (Amodio et al. 2016b;D'Ambrosio et al. 2017).In general, the complexity of the rank aggregation problem is governed by the number of objects, not by the number of individuals.When the number of objects is larger than 15 of 20, then only accurate heuristic algorithms can be used to detect the median ranking.Moreover, there are several tied rankings not governed by any fixed pre-specified pattern of ties (Marden 1995).This consideration does not make it possible to tackle the median ranking detection with classical (and modern) distance-based models.
Figure 4 shows that the U function (Eq.8) indicates the solution with three clusters.
The within-cluster homogeneity measures ĥk (Eq. 6)are equal to 0.679, 0.379, and 0.388 for k = 1, 2 , and 3, respectively.The estimated proportion of cases into the clusters are resulted be equal to 0.398, 0.306, and 0.296, for a global averaged homogeneity H CCA (Eq.7) of the CCA solution equal to 0.502.Figure 5 shows the geometrical representation of allocation probabilities through a correspondence analysis, for which there is a 100% fit.
The figure confirms the highest value of the internal homogeneity for cluster 1, in which the rankings are more concentrated around the cluster center.By looking at the rankings with the highest probability of belonging to a particular cluster, the first group is the synthesis of the ordering of the States mainly characterized by high level of medicare program payment, per capita local government expenditure for justice activity and corrections, and mixed race population.The second cluster is mainly formed by States with high murder rate, robbery rate, rape rate, reported arrest rate, and so on.Finally, the third cluster if formed by high level of Gross Domestic Product per capita, personal income, and high percentage of health insurance.
Table 2 shows the median rankings within each cluster.Tied rankings are enclosed in brackets.The computing time of the analysis, conducted with 30 random repetitions for k = 2, … , 5 , was about 12 min.

Concluding remarks
In this paper, we introduced a distribution-free soft-clustering method dealing with preference rankings named K-median cluster component analysis (CCA).Our approach follows the probabilistic distance clustering approach, according to which the probability of each data point to belong to a given cluster is inversely related to the distance between them in the sample space.As rankings are a particular kind of data with a specific discrete sample space, the key point of our proposal lies in the search of a weighted median ranking, which is a discrete optimization problem, by working directly with a loss function in terms of the Kemeny distance.In this way, we have no problems if a ranking coincides with a cluster center, which is common especially if the number of items is not large.
We performed a simulation study in which rank data have been generated by three different approaches: the Mallows-model, the Insertion Ranking Sorting model, and the Mallows-Bradley-Terry model.This choice is due to the consideration that our proposal is distribution free and we wanted to check the behavior of our proposal with data coming from several and different probability functions.The choice of the Kemeny distance as core of our proposal was made, because it is very flexible and it works with both full and tied rankings.We compared the performance of the CCA with a Mixture of Distance-Based models for rankings (MDB), which assumes that the population of judges is modeled by a (mixture of) Mallows models.We also compared the performances of CCA with the Insertion Sorting Ranking model-based clustering (ISR), which assumes that in the population rankings  are generated by an ISR model.A factorial experiment, evaluated through both the normalized degree of concordance and the adjusted concordance Index, has shown that CCA works well when the ranking generator process is either the Mallows or the ISR even if it is slightly outperformed by the respective competitors.The CCA outperformed both MDB and ISR when the ranking generator process is the Mallows-Bradley-Terry model.In any case, our proposal works without the need to assume any model for preference rankings, working with any kind of preference rankings: full, tied, and incomplete (or partial).About incomplete rankings, we currently deal with partial ranking by assuming, as usual (D'Ambrosio and Heiser 2016; Plaia and Sciandra 2017), that either unranked or missing items are in a tie at the last position.A way to deal directly with incomplete ranking removing this assumption is under study for a future work.
An application on a real data set shows the flexibility of our approach.The USA rank data set consists of a collection of both full and tied rankings.As the number of items is equal to 50 (the US States), it is an example of an analysis on a very hard data set in terms of finding the cluster centers.
A known problem of the original PD clustering is that it works well when the cluster sizes are approximately equal.In the literature, an adjustment of the PD clustering for cluster size has been proposed (Iyigun and Ben-Israel 2008), slightly modifying the loss function to take into account possible unbalanced clusters.In our case, the universe of rankings is discrete and finite.When there are few objects to be ranked, the same ranking can be repeatedly observed in the data set, allowing a 'natural' adjustment taking in account the relative frequencies of each singe ranking.Nevertheless, we experimented with adjustments for cluster size with simulated cases with extremely unbalanced clusters and a large number of items, and these adjustments did not work well.This topic, which is outside the scope of this study, will be addressed in a future work.

Fig. 1 Fig. 2
Fig. 1 Insertion Sorting Rank model-based sampling scheme.Boxes refer to the normalized degree of concordance (first column) and to the adjusted concordance indexes (second column)

Fig. 4
Fig. 4 Normalized classifiability measure over the cluster components

Fig. 5
Fig. 5 USA rank data: geometrical representation of allocation probabilities

Table 1
Summary of independent factors and levels of the simulation study