A study of data randomization on a computer based feature selection for diagnosing coronary artery disease

The objective of this research is to investigate the randomization of data on a computer based feature selection for diagnosing coronary artery disease. The randomization on Cleveland dataset was conducted because the performance value is different for each experiment. Assuming the performance values have a Gaussian probability distribution is a solution to handle different performance value provided by the process of randomizing dataset. The final performance is taken from the mean value of all performance value. In this research, computer based feature selection (CFS), medical expert based feature selection (MFS) and combined both of MFS and CFS (MFS+CFS) are also conducted to improve the performance of the classification algorithm. Also, this research found a different characteristic on Cleveland dataset from previous work. This difference obviously can affect the feature selection result and the final performance. In summary, the randomization dataset and computing the final performance can generally represent the performance of the classification algorithm.


1! Introduction
Coronary Artery Disease (CAD), sometimes called as Coronary Heart Disease (CHD) is the most common heart diseases. CAD occurs when the blood flow to the heart muscle in the coronary arteries is blocked by atherosclerosis (fatty deposits) [1]. It has a very high mortality rate, e.g. in 2008 an estimated 7.3 million deaths in the world are caused by CAD [2]. The initial diagnosis usually uses medical history and physical examination, then further testing can be done. For further testing, coronary angiography provides the "gold standard" diagnosis of disease in the coronary arteries [3]. Coronary angiography test is preferred by cardiologists to diagnose the presence of CAD with high accuracy even though invasive, risky and expensive [4].
From the shortcomings of this test, it is necessary to develop a method which is capable of diagnosing CAD before coronary angiography test. The goal is to avoid invasive, risky and expensive diagnostic procedures to the patient. Therefore, this motivates the development of a computer based method to be able to diagnose the presence of CAD. The computer based method can provide diagnostic procedures to patients in a way that is non-invasive, safe and less expensive.
Various computer based methods have been developed to identify heart related diseases. The methods of neural network [5], fuzzy [6] and data mining [7] are proposed to diagnose CAD. The neural network based methods have advantages on nonlinear prediction, strong on parallel processing and the ability to tolerate faults, but they have a weakness in the need for a large training data, over-fitting, slow convergence and local optimum [8]. Fuzzy logic offers reasoning at a higher level by using linguistic information obtained from domain experts, but fuzzy systems lack the ability to learn and cannot adjust to the new environment [9]. Data mining which is a process of extracting hidden knowledge from the data offers other advantages. This method can reveal patterns and relationships among large amounts of data in a single dataset or not [10].
In medical diagnosis, data reduction is an important issue. Medical data often contain a large number of features that are irrelevant, redundant and relatively small number of cases that can affect the quality of disease diagnosis [11]. Therefore, the feature selection process can be used to select relevant features in medical data. Feature selection is proposed in many researches [11] [12][13] [14][15] to improve the accuracy in the diagnosis of CAD.
Nahar et al. [14] performed computer-based feature selection process. This process is called by the computer feature selection (CFS). CFS selects features randomly so there is the possibility to dispose of medical significant factors. To avoid the loss of medical significant factors, the feature selection process needs to be carried out by medical experts (termed as MFS). These significant factors are age, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting heart rate, maximum heart rate and exercise-induced angina. For CFS, Nahar et al. used CfsSubsetEval as attribute selection method (using BestFisrt search strategy) provided by Weka.
There is a difference in the characteristics of the Cleveland dataset between current research and Nahar et al. [14]. The difference lies in the total of positive class instances. This difference obviously can affect the final performance. But there is something important issue that Nahar et al. did not considered. This issue is the effect of randomizing process in data which can affect the performance of computer based diagnosis. In this research, the study of randomizing medical data (Cleveland dataset) is discussed.

2.1! Cleveland dataset
In this research, feature selection process is used in Cleveland dataset. It used a maximum of 14 attributes of 76 attributes of the Cleveland dataset (with 303 total instances). Table 1 describes the attributes and their data types used in the dataset [14][16]. This dataset is converted from a multi-class problem into a binary-class in order to obtain five datasets with characteristics in accordance with table 2. From table 2, it can be seen that the total of positive class is same with the total instance of Cleveland dataset (303 instances). Logically, this must happen because the datasets (H-0, Sick-1, Sick-2, Sick-3 and Sick-4) are same with the Cleveland dataset. The difference is only the class label considered as positive. From table 3, it can be seen that the total of positive class is 308 instances. The total of positive class is different from the total instance of Cleveland dataset.

2.2! Motivated feature selection (MFS)
Motivated feature selection is the process of feature selection by medical experts.
There are eight factors to be considered by the medical significance of MFS in the process of feature selection. These factors are age, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting heart rate, maximum heart rate and exercise induced angina [14].

2.3! Computer feature selection (CFS)
For CFS, Nahar et al. use CfsSubsetEval as attribute selection method (using BestFisrt search strategy) provided by Weka. CFS selects features randomly so there is the possibility to dispose of medical significant factors [14].

2.4! Classifier algorithm
In this research, the six well known classifiers (Naïve Bayes, SMO, IBK, AdaBoostM1, J48 and PART) were used. This is the reason why Cleveland dataset has to convert to binary-class, because these algorithms are binary classifier.

2.4.1! Naïve Bayes
A Naïve Bayes is a probabilistic classification algorithm based on applying Bayes theorem assuming a strong independent. Eqn (1) is the probability of data record X that has a label Cj. (1) Cj is class label with the largest conditional probability value determines the category of the data record [7].

2.4.2! SMO
There are two components of the SMO algorithm. These components are an analytical method to solve two Lagrange multipliers and heuristic methods to determine the optimizing multiplier. This algorithm was introduced by John Platt in 1998 at Microsoft Research [17].

2.4.3! IBK
The algorithm found a group of k objects in the training set that are closest to the test object and the label on the basis of the assignment of a certain class domination. It discusses the main issue in many datasets that may not be exactly matching one object with another object, as well as the fact that conflicting information about the class of an object can be obtained from the nearest objects [18].

2.4.4! AdaBoostM1
"Boosting" is a general method for improving the performance of any learning algorithm. The boosting can be used to significantly reduce the error of any "weak" learning algorithm that consistently generates classifiers which need only be a little bit better than random guessing [19].

2.4.5! J48
J48 is a classification algorithm that implements the C4.5 algorithm [10]. C4.5 algorithm is intended for supervised learning. C4.5 learns to mapping an attribute values to a class that can be applied to classify a new class (unseen instance) [18].

2.4.6! PART
PART algorithm builds a tree using C4.5's heuristics with the parameters specified by the user same with J48. The rules of the classification algorithm derived from the partial decision tree. Partial decision tree is a decision tree that contains branches of undefined sub-trees [10].

2.5! Research design
The CAD diagnosis process is provided by classifying Cleveland dataset using six well known algorithms. To provide a comparison among the classification algorithms, four performance metrics were used (accuracy, true positive rate, fmeasure and training time).
The process of randomizing instances in Cleveland dataset can affect the performance of computer based diagnosis. Every randomizing process provides different performance value. Assume the performance values have a distribution X and X is the mean of a random sample of size n taken from a population with mean µ and finite variance , 2 σ then the limiting form of the distribution of n [20].
Therefore, assuming the performance values have a Gaussian probability distribution is a solution to handle different performance value provided by the process of randomizing dataset. In this research, the minimum sample size is 100. The 100 times randomizing instance give 100 different performance values. Then the final performance is taken from the mean of those values. For each randomization instance, the performance of the algorithm can be obtained by using two-ways. First, applying the 10-fold cross-validation process on the dataset Cleveland. Second, applying the train-test split to a dataset and then used 10-fold cross-validation to choose the best parameters in the training process. In the process of train-test split, each dataset was subjected to a stratified sampling process to select two-thirds of the data for training and the rest for prediction. One of the tools provided by Weka (CVParameter) is used in the train-test split. Figure 1 and 2 describe these processes.

3.1! Computer Feature selection
A feature selection result that obtained from CFS can be seen in table 4. For each dataset, the features that are selected by CFS are different. H-0 chest pain, resting ECG, maximum heart rate, exercise induced angina, old peak, the number of vessels coloured, thal Sick-1 sex, chest pain, fasting blood sugar, resting ECG, exercise induced angina, thal Sick-2 chest pain, fasting blood sugar, maximum heart rate, exercise induced angina, oldpeak, number of vessels coloured, thal Sick-3 maximum heart rate, exercise induced angina, oldpeak, number of vessels coloured, thal Sick-4 resting ECG, oldpeak, number of vessels coloured It can be seen from table 4, CFS does not select the features that are considered as medical significant factor by MFS. Therefore, to avoid the medical significant factor is not selected, it is necessary to combine MFS and CFS. Table 5 shows the final performance of the classification algorithm when performed on the five datasets. The bold values indicate the best algorithm for each dataset. Applying 10-fold cross validation and CVP 10-fold, SMO is the best algorithm (in terms accuracy) in dataset Sick-1, Sick-2, Sick-3 and Sick-4 whereas Naïve Bayes is the best algorithm in dataset H-0. In terms true positive rate and f-measure, Naïve Bayes is the best algorithm in dataset Sick-2, Sick-3 and Sick-4. Table 6 shows the comparison of the final performance of the classification algorithm (in terms accuracy, true positive rate and f-measure) before and after the feature selection process is used. The bold values indicate the best algorithm for each dataset. Applying CFS process, the accuracy of CFS is better than MFS for all algorithm in dataset H-0. For dataset Sick-1, the accuracy of CFS is better than MFS for three cases (SMO, J48 and PART). For dataset Sick-2 and Sick-3, the accuracy of CFS is better than MFS for one case (PART). For dataset Sick-4, the accuracy of CFS is better than MFS for two cases (Naïve Bayes and PART). Table 6 also shows the performance result for CVP-10 fold (no feature selection). The highlighted values indicate the accuracy of MFS or CFS is better than CVP-10 fold. For H-0, the accuracy of CFS is better than CVP-10 fold in four cases (Naïve Bayes, SMO, AdaBoostM1 and J48). For Sick-1, the accuracy of CFS is better than CVP-10 fold in five cases (Naïve Bayes, IBK, AdaBoostM1, J48 and PART) and the accuracy of MFS is better than CVP-10 fold in four cases (Naïve Bayes, IBK, AdaBoostM1 and J48). For Sick-2, the accuracy of CFS and MFS is better than CVP-10 fold in three cases (Naïve Bayes, AdaBoostM1 and PART). For Sick-3, the accuracy of CFS is better than CVP-10 fold in three cases (Naïve Bayes, AdaBoostM1 and PART) and the accuracy of MFS is better than CVP-10 fold in three cases (Naïve Bayes, AdaBoostM1 and J48). For Sick-4, the accuracy of CFS and MFS is better than CVP-10 fold in three cases (Naïve Bayes, J48 and PART).

4! Conclusions and future works
The difference in the characteristics of the Cleveland dataset between current research and previous work, it obviously affects the feature selection result and the final performance. The total of positive class must same with the total instance of Cleveland dataset (303 instances). Gaussian probability distribution is a solution to handle the different performance value provided by the process of randomizing dataset. By dataset randomization then computes the final performance, it can generally represent the performance of the classification algorithm. Table 5 shows that CVP 10-fold improve the accuracy than 10-fold cross validation at dataset H-0 (Naïve Bayes and SMO), Sick-1 (Naïve Bayes and IBK), Sick-2 (Naïve Bayes, SMO, J48 and PART), Sick-3 (Naïve Bayes, SMO, IBK, J48 and PART) and Sick-4 (Naïve Bayes, AdaBoostM1 and PART). From the analysis of final performance result, it can be seen that the feature selection process (CFS and MFS) improve the accuracy in some case than only apply CVP 10-fold (without feature selection). Then, to improve the ability of computer based feature selection, the method of combined MFS and CFS can be proposed. From table 7, the method of combined MFS and CFS improve the accuracy in some case for dataset H-0, Sick-1, Sick-2 and Sick-3 than only apply MFS process. For CFS, this research only use one attribute selection method (CfsSubsetEval) so this is not generally represent the CFS process. In the future works, the modification of the CFS method with other attribute selection is recommended to improve the performance of diagnosing coronary artery disease. Also, the modification of CFS can combined with MFS to ensure the medical expert about the diagnosis result.