Improve The Performance of K-means by using Genetic Algorithm for Classification Heart Attack

ABSTRACT


INTRODUCTION
The tremendous progress that has accompanied computer science and the success it has achieved in various applications has made it more than just a computing machine and this has been a powerful motivation for scientists to develop and invent several technologies that try to exploit the capabilities of the computer to accomplish useful functions and find solutions to many problems to facilitate the joints of human life and reduce the problems that may be faced so many techniques have emerged including: (expert systems, networks and classification algorithms of various types) [3] .
Classification of diseases is a distinctive goal of artificial intelligence research that has tried to support the medical field and provide specialists of doctors, centers and hospitals with diagnostic systems that help to improve the accuracy of decision made on a situation and reduce errors that may be made in the diagnosis because of lack of experience or pressure stress which leads to problems in the accuracy of the diagnosis for specialist and also provides detailed medical data about the test in record time [6][7][8] .
The heart attack is one of the dangers diseases that threaten human life where The World Health Organization (WHO) reports that 12 million people die each year from heart disease [1]. Because the severity of disease many computer specialists presented on many years a lot of 1257 research aimed to supporting medical institutions and their staff with systems to diagnose this disease and research is still ongoing in the field [5]. Researchers rely on a global database known as )Statlog(. This database used in research that work on classification heart attack to measure the strength of the method proposed by the research. It can be obtained from the data warehouse ) UCI( allocated each row in this database for each patient. The total number of cases (patients) in the database are (270) case and each person stored 13 information (property): (age), (sex), (chest pain type), (blood pressure), (cholesterol), (blood sugar), (electrocardiographic results), (maximum heart rate) and other properties. The property 14 is represent the final diagnosis: the value of this property is (1) to indicate for infected person while the healthy person referred by making the value of property 14 equal to (0). Table 1 summarizes the important and most recent research that classified this disease by categorizing the database (Statlog) the data set in the table sorted by year of publication.

PROPOSED METHOD
Automated classification for diseases is one of the most important applications that use computers to serve people in health institutions. This study deals with using k-means method in the classification of heart attack and then proposes a method to improve the performance of this method by using the genetic algorithm for reducing properties and delete the insignificant properties.

Classify Database using (K-Means)
Initially it was selected as a method to classify the selected database according to the following steps: Algorithm (k-means( to classify heart attack Input: global database (Statlog(. Output: accuracy of classification. Steps: 1. Determine the number of clusters and be 2. 2. Choose two rows of the 270 rows to be the primary centers for the two clusters, and this is done randomly provided that one of the cases is classified (0) while the other is classified as (1). 3. Each case is allocated to the appropriate cluster by calculating the Euclidean distance between the case and the centers. 4. Update the counter responsible for the calculation of the number of cases correctly classified (z) if the k-means status classification is identical to the original category in the database.

Improved Performance of K-Means by the Genetic Algorithm
The classification system depends on properties have a significant impact on the accuracy of system especially some of these properties are not necessary and may cause the system to fall down so it is best to delete them. Because is complex and it is difficult to determine these properties that negatively affect on the performance of the system, this task was assigned to the genetic algorithm.
The genetic algorithm suggests the best properties that k-means can rely on it in the process of classification by using genetic processes to create generations of chromosomes. The proposed properties are derived from the chromosome which is evaluated by running the k-means and calculating the accuracy of the system. After producing several generations the algorithm ends with choosing the chromosome which provides the properties capable of raising the accuracy of the system to the highest possible level. Details of the proposed method are illustrated in the following steps: Step one: "constructing a genetic foundation" This phase includes three sub steps: 1) "Specify genetic algorithm coefficients " The database is stored in an excel file. The file is converted into a two-dimensional matrix containing 270 rows and 14 columns to prevent any errors or changed may be happened on these values and for ease of use. In this step specify some of the parameters that the genetic algorithm are need and as follows: 1. Length of chromosome= number of features in database=13 ( where each gene from chromosome is assigned to each property in the database and feature NO. 14 is excluded because it is an ideal output that is used to compare with the system outputs). It is worth mentioning that all the above parameters leave their value to the designer of the algorithm through experiment except chromosome length it is constant because it depends on the number of properties in the database.
2) "Generate primary society" The initial society is generated randomly according to the parameters specified in the previous step. The output of this step is a generation containing 50 chromosomes. The genes of the chromosome are given binary values (0, 1). If the value of the gene is 0 the feature will be neglected and considered an unnecessary feature to be disposed of. If the value of the gene is (1) this feature is important and is taken into account as one of the features which k-means is based in the classification. For example assume the genetic algorithm generated the next chromosome: In order to measure the quality of the features proposed by the genetic algorithm the k-means algorithm described in paragraph (3-1) is applied as if the database contained only the features proposed by the genetic algorithm and the other (non-important) features would be disregarded another statement : for each chromosome in the generation a k-means function is called for its evaluation thus the fitness value of the chromosome is the accuracy of the classification calculated by k-means which is illustrated by Equation (1(.
Step two: "Great generation through operations genetic" The genetic algorithm does not stop at the primary generation but continues to generate other generations by simulating the human way of generating backward generations to sustain life. The process of creating a family in human societies begins with the choice of two individuals. This choice is often made randomly and then children are born after marriage. In these children there may be genetic mutations to add diversity in society. This is exactly what the genetic algorithm does during the generation of other generations: selection, crossover, study of the probability of a mutation. The methods used to carry out genetic processes in this research are: a. Execute selection process by using binary set method. b. Select uniform mating method to perform crossover. c. The mutation is implemented in (2m).
As with individual of the primary society the same method is used to evaluate the chromosomes of new generations by calculating the accuracy of the classification by calling the classifier (k-means) as described in step 2. The genetic algorithm continues to generate communities until the number of generations generated reach 60 and the stopping condition adopted in this research.

Result of Proposed Method and Analysis Performance System
The proposed method was programmed using Matlab version (R2011a). Figure 1 shows the system interface. The interface is designed to compare the performance of k-means alone with the performance of the proposed method to improve classifier k-means when adding the genetic algorithm to select important and useful properties through the following points: a. Display the accuracy of the systems (k-means) and improved method which are calculated by applying Equation (1). b. Display the number of valid cases (not patient) that were classified by both systems correctly. c. Calculate the number of infected cases (patient) that both systems can correctly classification. d. The final values of cluster centers. e. The system calculates some outputs that are unique for each method such as the primary centers which are the row numbers that are selected to k-means method and also display the important properties discovered by the proposed improved method. It is clear that the proposed method which is an improvement for the k-means method using the genetic algorithm gave better results by reducing the insignificant properties in the classification on the contrary the presence of such characteristics reduces the accuracy of the system and thus relied only on the six properties as shown in Figure 1. As a result the system's ability to distinguish healthy cases and cases of this disease are increased which in turn led to an increase in the accuracy of the classification. Table 2 summarizes the results of the system. Figure 2 shows the clear difference between the normal and hybrid methods in terms of accuracy.  Table 1 find that the proposed system obtained good results and acceptable compared to those research as shown in Table 3. Table 4 shows the good performance of the proposed method when compared with research results using the same method.   It is noted that the results of the program outperform the results of the research referred to in Table 4. It is worth mentioning that the k-means method used in the research surpassed that used by researcher Shadi Abu Delafah in a research published in 2013 where the result of his method is capable of classification by 62% while k-means in this study was able to classify the disease with up to 68% accuracy.

CONCLUSIONS
This research discusses the classification of the internationally database known (Statlog) which is related to heart attack using the method (K-means). The accuracy of the classification based on this method was (68%) and then added the genetic algorithm to strengthen the performance of (k-means) by reducing the characteristics adopted during the classification process and found that the genetic algorithm has been instrumental in raising the accuracy of the system where reaching 84% after it was (68%). The results of the application of the system which is designed to classification database cases automatically based on the intelligent recruitment of the computer capabilities without resorting to specialized expertise and comparing the results of this work with the results of the previous works listed in Table 1