The effective use of the DSmT for multi-class classification

The extension of the Dezert-Smarandache theory (DSmT) for the multi-class framework has a feasible computational complexity for various applications when the number of classes is limited or reduced typically two classes. In contrast, when the number of classes is large, the DSmT generates a high computational complexity. This paper proposes to investigate the effective use of the DSmT for multi-class classification in conjunction with the Support Vector Machines using the One-Against-All (OAA) implementation, which allows offering two advantages: firstly, it allows modeling the partial ignorance by including the complementary classes in the set of focal elements during the combination process and, secondly, it allows reducing drastically the number of focal elements using a supervised model by introducing exclusive constraints when classes are naturally and mutually exclusive. To illustrate the effective use of the DSmT for multi-class classification, two SVM-OAA implementations are combined according three steps: transformation of the SVM classifier outputs into posterior probabilities using a sigmoid technique of Platt, estimation of masses directly through the proposed model and combination of masses through the Proportional Conflict Redistribution (PCR6). To prove the effective use of the proposed framework, a case study is conducted on the handwritten digit recognition. Experimental results show that it is possible to reduce efficiently both the number of focal elements and the classification error rate.


Introduction
Nowadays a large number of classifiers and methods of generating features is developed in various application areas of pattern recognition [1,2].Nevertheless, it failed to underline the incontestable superiority of a method over another in both steps of generating features and classification.Rather than trying to optimize a single classifier by choosing the best features for a given problem, researchers found more interesting to combine the recognition methods [2,3].Indeed, the combination of classifiers allows exploiting the redundant and complementary nature of the responses issued from different classifiers.
Researchers have proposed various approaches for combining classifiers increasingly numerous and varied, which led the development of several schemes in order to treat data in different ways [2,3].Generally, three approaches for combining classifiers can be considered: parallel approach, sequential approach and hybrid approach [2].Furthermore, these ones can be performed at a class level, at a rank level, or at a measure level [4][5][6][7].
In many applications, various constraints do not allow an efficient joint use of classifiers and feature generation methods leading to an inaccurate performance.Thus, an appropriate operating method using mathematical approaches is needed, which takes into account two notions: uncertainty and imprecision of the responses of classifiers.In general, the most theoretical advances which have been devoted to the theory of probabilities are able to represent the uncertain knowledge but are unable to model easily the information which is imprecise, incomplete, or not totally reliable.Moreover, they often lead to confuse both concepts of uncertainty and imprecision with the probability measure.Therefore, new original theories dealing with uncertainty and imprecise information have been introduced, such as the fuzzy set theory [8], evidence theory [9,10], possibility theory [11] and, more recently, the theory of plausible and paradoxical reasoning [12][13][14].
The evidence theory initiated by Dempster and Shafer termed as Dempster-Shafer theory (DST) [9,10] is generally recognized as a convenient and flexible alternative to the bayesian theory of subjective probability [15].The DST is a powerful theoretical tool which has been applied in many kinds of applications [16] for the representation of incomplete knowledge, belief updating and for the combination of evidence [17,18] through the Dempster-Shafer's combination rule.
Indeed, it offers a simple and direct representation of ignorance and has a low computational complexity [19] for most practical applications.
Nevertheless, this theory presents some weaknesses and limitations mainly when the combined evidence sources become very conflicting.Furthermore, the Shafer's model itself does not allow necessary holding in some fusion problems involving the existence of the paradoxical information.To overcome these limitations, a recent theory of plausible and paradoxical reasoning, known as Dezert-Smarandache theory (DSmT) in the literature, was elaborate by Jean Dezert and Florentin Smarandache for dealing with imprecise, uncertain and paradoxical sources of information.Thus, the main objective of the DSmT was to introduce combination rules that would allow to correctly combining evidences issued from different information sources, even in presence of conflicts between sources or in presence of constraints corresponding to an appropriate model (free or hybrid DSm models [12]).The DSmT has proved its efficiency in many current pattern recognition application areas such as remote sensing [20][21][22][23], identification and tracking [24][25][26][27][28][29], biometrics [30][31][32][33], computer vision [34][35][36], robotics [37][38][39][40][41][42] and more recently handwritten recognition applications [7,43,44] as well as many others [12][13][14].
The use of the DSmT for multi-class classification has a feasible computational complexity for various applications when the number of classes is limited or reduced typically two classes [43].In contrast, when the number of classes is large, the DSmT generates a high computational complexity closely related to the number of elements to be processed.Indeed, an analytical expression defined by Tombak et al. [45] shows that the number of elements to be processed follows the sequence of Dedekind's numbers [46,47]: 1,2,5,19,167,7580,7828353,...For instance, if the number of classes belonging to discernment space is 8, then the number of elements to be deal in DSmT framework is 22 10 6 .

 
. Hence, it is not easy to consider the set of all subsets of the original classes (but under the union and the intersection operators) since it becomes untractable for more than 6 elements in the discernment space [48].Thus, Dezert and Smarandache [49] proposed a first work for ordering all elements generated using the free DSm model for matrix calculus such as made in DST framework [50,51].However, this proposition has limitations since in practical applications it is more appropriate to only manipulate the focal elements [7,[52][53][54].
Hence, few works have already been focused on the computational complexity of the combination algorithms formulated in DSmT framework.Djiknavorian and Grenier [53] showed that there's a way to avoid the high level of complexity of DSm hybrid (DSmH) combination algorithm by designing a such code that can perform a complete DSmH combination in very short period of time.However, even if they have obtained an optimal process of evaluating DSmH algorithm, first some parts of their code are really not optimized and second it has been developed only for a dynamic fusion.Martin [55] further proposed a practical codification of the focal elements which gives only one integer number to each part of the Venn diagram representing the discernment space.Contrary to the Smarandache's codification [48] used in [56] and the proposed codes in [53], author thinks that the constraints given by the application must be integrated directly in the codification of the focal elements for getting a reduced discernment space.Therefore, this codification can drastically reduce the number of possible focal elements and so the complexity of the DST as well as the DSmT frameworks.A disadvantage of this codification is that the complexity increases drastically with the number of combined sources especially when dealing with a problem in the multi-class framework.To address this issue, Li et al. [57] proposed a criterion called evidence supporting measure of similarity (ESMS), which consists in selecting, among all sources available, only a subset of sources of evidence in order to reduce the complexity of the combination process.However, this criterion has been justified for only a two-class problem.
Nowadays, the complexity of reducing both the number of combined sources and the size of the discernment space are research challenges that still need to be addressed.
In many pattern recognition applications, the classes belonging to the discernment space are naturally and then mutually exclusive such as in biometrics [30][31][32][33] and handwritten recognition applications [7,43,44].Hence, several classification methods have been proposed as template matching techniques [58][59][60], minimum distance classifiers [61,62], support vector machine (SVM) [63], hidden Markov Models (HMMs) [63][64][65], neural networks [66,67].In various pattern recognition applications, the SVMs have proved their performance from the mid-1990s comparatively to other classifiers [2].The SVM is based on an optimization approach in order to separate two classes by an hyperplane.In the context of multi-class classification, this optimization approach is possible [68] but requiring a very costly duration.Hence, two preferable methods of multi-class implementation of SVMs have been proposed for combining several binary SVMs, , which are One Against All (OAA) and One Against One (OAO), respectively [69][70][71].The former is the most commonly used implementation in the context of multi-class classification using binary SVMs, which constructs n SVMs to solve a n -class problem [72].Each SVM is designed to separate a simple class i  from all the others, i.e., from the corresponding complementary class SVMs.Hence, various decision functions can be used such as the Decision Directed Acyclic Graph (DDAG) [73] since it has the advantage to eliminate all possible unclassifiable data.
Generally, the combination of binary classifiers is performed through very simple approaches such as voting rule or a maximization of decision function coming from the classifiers.In this context, many combination operators can be used, especially in the DST framework [74].Still in the same vein, some works have been tried out the combination of binary classifier originally from SVM in the DST framework [75,76].For instance, the pairwise approach has been revisited by Quest et al. [76][77][78][79] in the framework of the DST of belief functions for solving a multi-class problem.In [80], the combination method based on DST has been used by Hu et al. for combining multiple multi-class probability SVM classifiers in order to deal with distributed multi-source multi-class problem [80].Martin and Quidu proposed an original approach based on DST [81] for combining binary SVM classifiers using OAO or OAA strategies, which provides a decision support helping experts for seabed characterization from sonar images.Burger et al. [82] proposed to apply a belief-based method for SVM fusion to hand shape recognition.Optimizing the fusion of the sub-classifications and dealing with undetermined cases due to uncertainty and doubt have been investigated by other works [83], through a simple method, which combines the fusion methods of belief theories with SVMs.Recently, one regression based approach [84] has been proposed to predict membership or belief functions, which are able to model correctly uncertainty and imprecision of data.
In this work, we propose to investigate the effective use of the DSmT for multi-class classification in conjunction with the SVM-OAA implementation, which allows offering two advantages: firstly, it allows modeling the partial ignorance by including the complementary classes in the set of focal elements, and then in the combination process, contrary to the OAO implementation which takes into account only the singletons, and secondly, it allows reducing drastically the number of focal elements from The reduction is performed through a supervised model using exclusive constraints.
Combining the outputs of SVMs within DSmT framework requires that the outputs of SVMs must be transformed into membership degree.Hence, several methods of estimating of mass functions are proposed in both DST and DSmT frameworks, these ones can be directly explicit through special functions or indirectly explicit through transfer models [9,[85][86][87][88].In our case, we propose a direct estimation method based on a sigmoid transformation of Platt [89].This allows us to satisfy the OAA implementation constraint.
The paper is organized as follows.Section 2 reviews the Proportional Conflict Redistribution (PCR6) rule based on DSmT.Section 3 describes the combination methodology for multi-class classification using the SVM-OAA implementation.
Experiments conducted on the dataset of the isolated handwritten digits are presented in section 4. The last section gives a summary of the proposed combination framework and looks to the future research direction.

Review of PCR6 combination rule
In pattern recognition, the multi-class classification problem is generally formulated as a n -class problem where classes are associated to patterns classes, namely , , , 1 0    and n  .Hence, the parallel combination of two classifiers, namely information sources 1 S and 2 S , respectively, is performed through the PCR6 combination rule based on the DSmT.For n - class problem, a reference domain also called the discernment space should be defined for performing the combination, which is composed of a finite set of exhaustive and mutually exclusive hypotheses.
In the context of the probabilistic theory, the discernment space, namely  , is composed of n elements as: , and a mapping function is associated for each class, which defines the corresponding mass . In Bayesian framework, combining two sources of information by means of the weighted mean and consensus based rules seems effective for non-conflicting responses [90][91][92][93].In the opposite case, an alternative approach has been developed in DSmT framework to deal with (highly) conflicting imprecise and uncertain sources of information [14].Example of such approaches is PCR6 rule.
The main concept of the DSmT is to distribute unitary mass of certainty over all the composite propositions built from elements of  with  (Union) and  (Intersection) operators instead of making this distribution over the elementary hypothesis only.Therefore, the hyper-powerset  D is defined as: 3. No other elements belong to  D , except those obtained by using rules 1 or 2.
The DSmT uses generalized basic belief mass, also known as the generalized basic belief assignment (gbba) computed on hyper-powerset of  and defined by a map    m by means of the PCR6 rule [13,14] are defined as: Where is the set of all relatively and absolutely empty elements, M  is the set of all elements of  D which have been forced to be empty in the hybrid model M defined by the exhaustive and exclusive constraints, Ø is the empty set, the denominator is different to zero, and where   Thus, the term represents a conjunctive consensus, also called DSm Classic (DSmC) combination rule [13,14], which is defined as: Ø, if 0 (4)

Methodology
The proposed combination methodology shown in Fig. 1 is composed of two individual systems using SVMs classifiers.
Each one is trained using its own source of information providing two kinds of complementary features, which are combined through the PCR6 rule.In the following, we give a description of each module composed our system.The classification based on SVMs has been used widely in many pattern recognition applications as the handwritten digit recognition [2].The SVM is a learning method introduced by Vapnik et al. [94], which tries to find an optimal hyperplane for separating two classes.Its concept is based on the maximization of the distance of two points belonging each one to a class.
Therefore, the misclassification error of data both in the training set and test set is minimized.
Basically, SVMs have been defined for separating linearly two classes.When data are non linearly separable, a kernel function K is used.Thus, all mathematical functions, which satisfy Mercer's conditions, are eligible to be a SVM-kernel [94].Examples of such kernels are sigmoid kernel, polynomial kernel, and Radial Basis Function (RBF) kernel.Then, the decision function , is expressed in terms of kernel expansion as: where k  are Lagrange multipliers, Sv is the number of support vectors k x which are training data, such that C is a user-defined parameter that controls the tradeoff between the machine complexity and the number of nonseparable points [73], the bias b is a scalar computed by using any support vector.
Finally, for a two-class problem, test data are classified according to: The extension of the SVM for multi-class classification is performed according the One Against-All (OAA) [95].Let a set of N training samples which are separable in The principle consists to separate a class from other classes.Consequently, n SVMs are required for solving n class problem.

Classification Based On DSmT
The proposed classification based on DSmT is presented in Fig. 2, which is conducted into three steps: i) estimation of masses, ii) combination of masses through the PCR6 combination rule and iii) decision rule.

Estimation of Masses
The difficulty of estimating masses is increased if one assigns weights to the composed classes [96].Therefore, transfer models of the mass function have been proposed whose the aim is to distribute the initial masses on the simple and compound classes associated to each source.Thus, the estimation of masses is performed into two steps: i) assignment of membership degrees for each simple class through a sigmoid transformation proposed by Platt [89], ii) estimation of masses of simple classes and their complementary classes using a supervised model, respectively.
 Calibration of the SVM outputs: Although, standard SVM is very discriminative classifier, its output values are not calibrated for appropriately combining two sources of information.Hence, an interesting alternative is proposed in [89] to transform the SVM outputs into posterior probabilities.Thus, given a training set of instance-label pairs , where , the unthresholded output of an SVM is a distance measure between a test pattern and the decision boundary as given in (5).Furthermore, there is no clear relationship with the posterior class probability   x f using Gaussian distribution of equal variance and then compute the probability of the class given the output by using Bayes' rule.This yields a sigmoid allowing to estimate probabilities:

Parameters
A and B are tuned by minimizing the negative log-likelihood of the training data: where denotes the probability target.
 Supervised Model: the gbba provided by two distinct information sources 1 S (First descriptor) and 2 S (Second descriptor), F is the set of focal elements for each source, such that   the classes i  are separable (One relatively to its complementary class i  ) using the SVM-OAA multi-class implementation corresponding to different singletons of the patterns assumed to be known.Therefore, each compound element F A i  has a mass 1 m equal to zero, on the other hand, the mass of the complementary element 2. Classify a pattern x through the SVM-OAA implementation.
3. Transform each SVM output to the posteriori probability using Eq. ( 12). 4. Compute the masses associated to each class and its complementary using Eq. ( 9) and Eq. ( 10), respectively.

Combination of masses
In order to manage the conflict generated from the two information sources 1 S and 2 S (i.e. both SVM classifications), the combined masses are computed as follows: where  defines the PCR6 combination rule as given in (1).Hence, in the context of some application of pattern recognition area, such as handwritten digit recognition, we take as constraints the propositions ( Ø   ), such that j i  , which allow separating between each two classes belonging to  .Therefore, the hyper power set  D is reduced to the set F as   , which defines a particular case of the Shafer's model.Thus, the conflict measured between two sources is defined as:

Decision rule
A membership decision of a pattern to one of the simple classes of  is performed using the statistical classification technique.First, the combined beliefs are converted into probability measure using a new probabilistic transformation, called Dezert-Smarandache probability (DSmP), that maps a belief measure to a subjective probability measure [14] defined as: where M is the Shafer's model for  , and ) ( denotes the DSm cardinal of k A [12].Therefore, the maximum likelihood (ML) test is used for decision making as follows: where x is the pattern test characterized by both descriptors, which are used during the feature generation step, and  is fixed to 0.001 in the decision measure given by (15).

Database description and performance evaluation
For evaluating the effective use of the DSmT for multi-class classification, we consider a case study conducted on the handwriting digit recognition application.For this, we select a well-known US Postal Service (USPS) database that contains normalized grey-level handwritten digit images of 10 numeral classes, extracted from US postal envelopes.All images are segmented and normalized to a size of 16 16  pixels.There are 7291 training data and 2007 test data where some of them are corrupted and difficult to classify correctly (Fig. 3).The partition of the databse for each class according tranining and testing is reported in table 1.For evaluating performances of the handwritten digit classification, a popular error is considered, which is the Error Rate per Class (ERC) and Mean Error Rate (MER) for all classes.Both errors are expressed in %.

Pre-processing
The acquired image of isolated digit should be processed to facilitate the feature generation.In our case, the pre-processing module includes a binarization step using the method of Otsu [97], which eliminates the homogeneous background of the isolated digit and keeps the foreground information.Thus, we use the processed digit without unifying size image for recognition process.

Feature Generation
The objective of the feature generation step is to underline the relevant information that initially exists in the raw data.Thus, an appropriate choice of the descriptor improves significantly the accuracy of the recognition system.In this study, we use a collection of popular feature generation methods, which can be categorized into background features [98,99], foreground features [98,99], geometric features [2], and uniform grid features [100,101].

Validation of SVM Models
The SVM model is produced for each class according the used descriptor.Hence, the training dataset is partitioned into two equal subsets of samples, which are used for training and validating each binary SVM, respectively.Thus, the validation phase allows finding the optimal hyperparameters for the ten SVM models.In our case, the RBF kernel is selected for the experiments.Furthermore, both the regularization and RBF kernel parameters    ,

C
of each SVM are tuned experimentally during the training phase in such way that the misclassification error of data in the training subset is zero and the validation test gives a minimal error during validation phase for each SVM separating between a simple class and its complementary class.
Table 2 shows an example of the optimal parameters, which are obtained during both training and validation phases by using the UG-SVMs classifier.The parameters n and m define the number of the lines (vertical regions) and columns (horizontal regions) of the grid, respectively, which have been optimized during the validation phase for each SVM model.Therefore, these all parameters are used afterwards during the testing phase.ERCs and ERCc are the Error Rates per Class for simple and complementary classes, respectively.As we can see, the choice of the optimal size of the uniform grid and hyperparameters of each SVM should be tuned carefully in order to produce a reduced error.The testing phase is performed using all samples from the test dataset.Hence, the performance of the handwritten digit recognition classification is evaluated on an appropriate choice of descriptors using the SVM classifiers and then we evaluate the combination of the SVMs classifiers within DSmT framework.

Comparative analysis of features
The choice of the complementary features is an important step to ensure efficiently the combination.Indeed, the DSmT-based combination allows offering an accurate performance when the selected features are complementary.Hence, we propose in this section the performance of features in order to select the best ones for combining through the DSmT.For this, we evaluate each SVM-OAA implementation using Foreground Features (FF), Background Features (BF), Geometric Features (GF), Uniform Grid Features (UGF), and the descriptors deduced from a concatenation between at least two simple descriptors such as (BF,FF), (BF,FF,GF) and (UGF,BF,FF,GF).Indeed, the experiments have shown that the appropriate choice of both descriptors and concatenation order to represent each digit class in the feature generation step provides an interesting error reduction.In table 3, FF and UGF-based descriptors using SVM classifiers are evaluated.When concatenating background and foreground (BF,FF)-features, we observe a significant reduction of the MER.Indeed, an error rate reduction of 6.71% is obtained when concatenating BF and FF, respectively.Furthermore, an error rate reduction of 1.5% is obtained when concatenating BF, FF and GF, respectively.This proves that BF, FF and GF are complementary and are more suitable for concatenation.In contrast, when concatenating UGF with BF, FF and GF, the MER is increased to 2.73% comparatively to UGF.This proves that the concatenation does not always allow improving the performance of the classification.Thus, we expect that the UGF and (BF,FF,GF) descriptors are more suitable for combining through the DSmT.

Performance evaluation of the proposed combination framework
In these experiments, we evaluate a handwritten digit recognition classification based on a combination of SVM classifiers through DSmT.The proposed combination framework allows exploiting the redundant and complementary nature of the (BF,FF,GF) and UGF-based descriptors and manage the conflict provided from the outputs of SVM classifiers.
Decision making will be only done on the simple classes belonging to the frame of discernment.Hence, we consider in both combination process and calculation of the decision measures the masses associated to all classes representing the partial ignorance such that j i  .Thus, in order to appreciate the advantage of combining two sources of information through the DSmT-based algorithm, Figure 4 shows values of the distribution of the conflict measured for each test sample between both SVM-OAA implementations using (BF,FF,GF) and UGF-based descriptors for the 10 digit classes   For an objective evaluation, Table 5 shows ERC and MER produced from three SVM-OAA implementations using UGF, (BF,FF,GF), the descriptor resulting from a concatenation of both UGF and (BF,FF,GF) (i.e.combination at features level) and finally the PCR6 combination rule (i.e.combination at measure level) performed on (BF,FF,GF) and UGF based descriptors, respectively.Overall, the proposed framework using PCR6 combination rule is more suitable than individual SVM-OAA implementations since it provides a MER of 5.43% comparatively to the concatenation which provides a MER of 9.71%.However, when inspecting carefully each class, we can note that the PCR6 combination rule allows keeping or reducing in the most cases the ERC except for the samples belonging to classes 2  and 6  .This bad performance is due to the wrong characterization of both UG and (BF,FF,GF)-based descriptors.In other words, the PCR6 combination is not reliable when the complementary information provided from both descriptors is wrongly preserved.
Thus, PCR6 combination rule allows managing correctly the conflict generated from SVM-OAA implementations, even when they provide very small values of the conflict (see Table 4) specifically in the case of samples belonging to 8  .Thus, the DSmT is more appropriate to solve the problem for handwritten digit recognition.Indeed, the PCR6 combination rule allows an efficient redistribution of the partial conflicting mass only to the elements involved in the partial conflict.After redistribution, the combined mass is transformed into the DSm probability and the maximum likelihood (ML) test is used for decision making.Finally, the proposed algorithm in DSmT framework is the most stable across all experiments whereas recognition accuracies pertaining to both individual SVM classifiers vary significantly.

Conclusion and future work
In this paper, we proposed an effective use of the DSmT for multi-class classification using conjointly the SVM-OAA implementation and a supervised model.Exclusive constraints are introduced through a direct estimation technique to compute the belief assignments and reduce the number of focal elements.Therefore, the proposed framework allows reducing drastically the computational complexity of the combination process for the multi-class classification.A case study conducted on the handwritten digit recognition shows that the proposed supervised model with PCR6 rule yields the best performance comparatively to SVM multi-classifications even when they provide uncalibrated outputs.In continuation to the present work, the next objectives consist to adapt the use of one-class classifiers instead of the OAA implementation of SVM in order to obtain a fixed number of focal elements within DSmT combination process.This will allow us to have a feasible computational complexity independently of the number of combined sources and the size of the discernment space.
given source of evidence which can support paradoxical information, as follows:

Fig 1 .
Fig 1. Structure of the combination scheme using SVM and DSmT

Fig 2 .
Fig 2. DSmT-based parallel combination for multi-class classification factors introduced in the axiomatic approach in order to respect the mass definition,  b P are the posterior probabilities issued from the first source   1  b and the second source   2  b , respectively.They are given for a test pattern x as follows:

B 1 .
are the parameters of the sigmoid function tuned by minimizing the negative log-likelihood during training for each class of patterns i Define a frame of discernment of all relatively and absolutely empty elements, assignments provided by two information sources 1 S and 2 S , respectively.

Fig 3 .
Fig 3. Some samples with their alleged classes from USPS database.

Fig 4 .
Fig 4. Conflict between both SVMs classifiers using (BF,FF,GF) and UGF-based descriptors for the ten digit classes   9 , , 1 , 0 ,   i i  which represents the mass of the partial ignorance.The same reasoning is applied to the classes issued

Table 1 .
Partitioning of the USPS dataset

Table 2 .
Optimal parameters of the UG-SVMs classifier

Table 3 .
Mean error rates of the SVM classifiers using different methods of feature generation Table 4reports the minimal and maximal values of the conflict   model, which represent the mass assigned to the empty set, after combination process.As we can see, the conflict is maximal for the digit 4 while it is minimal for the digit 9.

Table 4 .
Ranges of conflict variations measured between both SVM-OAA implementations using (BF,FF,GF) and UGF-

Table 5 .
Error rates of the proposed framework with PCR6 combination rule using (BF,FF,GF) and UGF descriptors