Breast Cancer Diagnosis Improvement Using Feature Selection

The objective of this research is to improve the breast cancer diagnosis performance by applying feature selection methods to several classification algorithms. This study uses Winconsin Breast Cancer Dataset (WBCD). Feature selection methods based on Rough Set (RS) and F-score (FS) are used for several classification algorithms, which are SMO (Sequential Minimal Optimization), Multi-Layer Perceptron (MLP), Naive-Bayes, C4.5, Instance Base Learning (IBK) and PART. This study uses 10-fold cross validation as an evaluation method. The results show that feature selection methods can improve diagnosis performance with a smaller number of features.


Introduction
Breast cancer is the most common cancer causing death in women. The frequency of this disease is relatively high both in the developed and less developed countries. It is estimated that more than 508,000 women died worldwide in 2011 due to breast cancer [1]. Based on current world breast cancer Age-standardized, Indonesia has mortality rates of 18.6 per 100,000 population in 2008 [2].
In line with the information technology advancement, particularly in the field of artificial intelligence, machine learning techniques which is considered as artificial intelligence methods are introduced to help and improve the automated detection capability. Therefore the machine learning techniques can be used to assist medical experts and medical data can be analysed in a short time period as well as more detailed [1].
Many researchers used statistical and artificial intelligence techniques to predict breast cancer. The aim of these techniques are to specify identification of patients into a benign or malignant [2].
High-dimensional of medical data is one of the problems in the application of machine learning techniques because it makes a negative effect on the process of analysis. Dealing with the high-dimensional medical data, reducing the feature becomes very important. The advantages of feature reduction are as folows: (1) to avoid over-fitting, (2) to reduce the complexity of data analysis and (3) to improve data analysis performance [3].
One of the solutions to reduce the high-dimensional data is by using feature selection. Feature selection is part of the pre-processing phase in the classification process. Feature selection greatly affects the classification accuracy. The process of feature selection is proposed in many studies to improve accuracy of breast cancer diagnosis [4][5][6][7].
Chen et al. [8] used Rough set (RS) as feature selection. RS feature selection identifies significant features and eliminates irrelevant features to produce a good learning model. Therefore it reduces the dimensionality of data. Akay et al.
[1] used F-score (FS) as feature selection. FS is a simple feature selection technique that measures discrimination two sets of real numbers. Feature having a low FS value is considered to have a low discriminatory capability and vice versa. Feature with a high FS value will have a high discriminative ability anyway.
In aforementioned works, Akay et al. [3] and Chen et al. [10] only tried on one classification algorithm, that is Support Vector Machine (SVM). Both of them have good accuracy after implementing feature selection and SVM, but it is not representative for the general FS and RS feature selections performance on other classification algorithms. Therefore, in this paper a technique based on FS and RS feature selections is proposed on seven classification algorithms, these classification algorithms are SMO (Sequential Minimal Optimization), Multi-Layer Perceptron (MLP), Naive-Bayes, C4.5, Instance Base Learning (IBK) and PART. The results of classification algorithm are compared for performance analysis.

Feature selection
The basic concept of feature selection methods used in this research is described the following sub-section.

F-score
FS is a simple technique that measures the discrimination of two sets of real numbers. With training vectors xk, k = 1,. . . . m, if the number of positive and negative instance n + and nrespectively, then the FS of the i th feature is defined in (1).
is the ith feature of the kth negative instance. Discrimination between positive and negative sets indicated by the numerator, and the denominator indicates one within each of the two sets. A feature that has a large f-score is the feature that is more discriminative. Therefore, we use this score as the feature selection criterion [9].

Rough set
RS theory is a new intelligent mathematical tool proposed by Pawlak [12] to deal with uncertainty and incompleteness. It is based on the concept of an upper and a lower approximation of a set, the approximation space and models of sets. The main advantage of RS theory is that it does not need any preliminary or additional information about data: like probability in statistics or basic probability assignment in Dempster Shafer theory and membership grade in fuzzy set theory. One of the major applications of RS theory is the attribute reduction that the elimination of attributes. The reduction of attributes is achieved by comparing equivalence relations generated by sets of attributes. Using the dependency degree as a measure, attributes are removed and reduced set provides the same dependency degree as the original [8].

SMO
SMO is an algorithm that breaks the large quadratic programming (QP) optimization problem of SVM. SMO uses the small QP problem, therefore it avoids using a time-consuming numerical QP optimization as an inner loop [13].

MLP
MLP algorithm is an algorithm which adopts the workings of human neural networks . This algorithm is well known because the learning process is able to do as directed. It consists of input, output, and a hidden layer. Learning algorithms are done using back propagation. Determination of the optimal weights will lead to the proper classification results [14].

Naive Bayes
Naive-Bayes classifier provides a simple approach, with a clear semantics, to representing, using, and learning probabilistic knowledge. This method is designed to be used in supervised induction tasks, the accuracy of predicting the class of the test sample and in which case training includes class information is the goal of performance [15].

C4.5
C4.5 is an algorithm for constructing decision trees are among the most well known and widely used of all machine learning methods. It is use the information entropy concept, making decisions is done by splitting each data attributes into smaller subsets in order to examine the entropy differences, and choose the attributes with the highest normalized information gain. The splitting stops when finding subset instances belong to the same class, and thus the leaf node gets created. If no leaf node is detected, J48 creates a higher up node decision based on the expected class value [16].

IBK
IBK based on their similarity. It is one of the most popular algorithms for pattern recognition. It is kind of lazy learning where the function is only approximated locally and all computations are deferred until classification process. An object is classified by a majority of its neighbours. K is a positive integer. A set of objects for which the correct classification is known are selected as the neighbours [17].

PART
PART is simplicity method for rule induction. It adopts the separate-and-conquer strategy in that it builds a rule, removes the instances it covers, and continues creating rules recursively for the remaining instances until none are left [18].

Performance evaluation criteria
Confusion matrix is a visualisation tool which is commonly used to present the accuracy of the classifiers. It is used to show the relationships between outcomes and predicted classes. Table 1 shows this confusion matrix. This dataset is commonly used by researchers who are using Machine learning as a classification method of breast cancer, the dataset contains 699 samples taken from needle aspirates from human breast cancer tissue, which are 16 instances are having a missing value. Because of missing values found in a very small number compared to the overall number of data, the 16 instances are removed, so that the number of instances is used as 683.
The data sets show in Table 2 consists of nine features, each of feature is represented as an integer between 1-10. , and attribute reduction is determined by choosing genetic algorithm (GA). A number subset of attributes obtained after reduction results, then selected the optimal subset attributes based on a subset of attributes that contain both attributes with strongest relevancy and weakest relevancy with decision attribute [21]. Selected optimum subset is shown in Table 3. FS feature selection, feature reduction is done by calculating the value of the FS for each attribute then the attribute value will be sorted from high to low then nine models with different number of attributes are composed, the nine subset attributes are shown in Table 4.

Parameter setting
Several classification algorithms require the setting of specific parameters. SMO algorithm applied in this research using RBF (Radial Basis Function) kernel as the kernel function, there are two parameters, C and Gamma, that should be determined. This study applies grid search techniques to find the optimum parameters of C and gamma, with the grid space log2 C {5,6,7. . . . , 20} and log2  {-10, -11,. . . . , 5}.
Multi-layer perceptron is applied in this research used a three-layer, this MLP consists of an input layer (28 neurons), one hidden layer (15 neurons) and one output layer (two neurons). Weight adjustment is done at 500 cycles.
C.45 and PART algorithms applied in this research using a standard confidence factor (25%).
IBK algorithm applied in this research using 1 for the number of neighbors, and linear search as a nearest neighbor search algorithm.

Classification
The Weka data mining tool was used to evaluate the Performance of SMO, MLP, Naive Bayes, C4.5, IBK and PART in each subset feature generated by two proposed features selection. 10-fold cross validation method was selected for performance evaluation of each classification algorithm. To make sure that the results were not biased, this research underwent 100 independent runs of the experiment and the average classification accuracies were computed, respectively. A scheme of this research can be seen in figure 1.

Experiment
(2) using F-score feature selection.   Table 7 shows the comparison of the classification performance before and after implementing of feature selection. Overall, the performance of classification with feature selection has a higher accuracy than without feature selection. This proved feature reduction algorithm can improve the performance in classification. The bold values are the highest accuracy. From table 7 it can be seen that the proposed method based SMO, the RS + SMO gives the best accuracy. For method based MLP, the RS + MLP gives the best accuracy. For method based Naive Bayes , the FS + Naive Bayes gives the best accuracy. For method based C4.5, the RS + C4.5 gives the best accuracy. For method based IBK, the RS + IBK gives the best accuracy. For method based PART, the RS + PART gives the best accuracy.

Conclusion
This research applied two feature selection methods which are RS and FS to several classification algorithms (SMO, MLP, Naive Bayes, C4.5, IBK and PART). The experimental results show that the implementation of feature selection process can generally improve all of the accuracy of the classification algorithm. Except for the method based on IBK, it can be seen from table 7 that the algorithm still gives the best accuracy without FS feature selection process.