Acute Coronary Syndrome Prediction Using Data Mining Techniques-An Application

— In this paper we use data mining techniques to investigate factors that contribute significantly to enhancing the risk of acute coronary syndrome. We assume that the dependent variable is diagnosis – with dichotomous values showing presence or absence of disease. We have applied binary regression to the factors affecting the dependent variable. The data set has been taken from two different cardiac hospitals of Karachi, Pakistan. We have total sixteen variables out of which one is assumed dependent and other 15 are independent variables. For better performance of the regression model in predicting acute coronary syndrome, data reduction techniques like principle component analysis is applied. Based on results of data reduction, we have considered only 14 out of sixteen factors.


I. INTRODUCTION
NE of the major causes of death worldwide is the cardiovascular disease (CVD).Acute coronary syndrome (ACS) is considered as one of the most common form of heart syndrome.The term ACS is used to cover any clinical symptom's group compatible with acute myocardial ischemia.The chest pain occurs due to the insufficient blood supply to the heart muscle is called 'Acute Myocardial Infarction' that results from coronary artery disease (also called coronary heart disease) [1].The rupture of an atherosclerotic plague is the cause of acute coronary syndrome [2].
Acute coronary syndrome (ACS) includes three acute manifestations of ischemic heart disease [3]: 1) Unstable angina (UA) 2) Non-ST elevation (MI) 3) Sudden cardiac death ECG changes include ST segment depression or T wave flattening.In unstable angina cardiac enzymes are not elevated while in non-ST elevation MI (NSTEMI) cardiac enzymes become elevated.In cardiac hospital, most of the patients have diagnosed ACS which may be a sign of left ventricular dysfunction during pain [3].Fig. 1 Classification of Acute Coronary Syndrome [4] In Pakistan, the prevalence of ACS is increasing rapidly.For example, 414 patients were admitted in National Institute of Cardiovascular Diseases in September 2000 with 71.25% males.Around 72.92% of the patients were in the fifth decade of life.The most common presentation was the acute coronary syndrome (ACS), present in 39.8% of the patients.Similarly, a total of 446 patients were admitted in September 2005.Now, males were 63%.Of these, 71.29% were in the fifth, sixth, and seventh decades of life.The patients admitted with acute coronary syndromes (ACS) were around 43.04% see [5].Thus, there is a need of exploration of those factors responsible to enhancing the risk of ACS for reducing the prevalence of this syndrome.
Data for this research were collected from two different cardiac hospitals, Karachi in the year 2008.There were 319 observations in the data set.The data set comprises one dependent variable (Diagnosis) and sixteen independent variables as given in Table I  In this research paper we have used logistic regression model to investigate factors that contribute significantly to enhancing the risk of ACS.For analyzing this problem, we observe whether a person have or does not have ACS.The paper is organized as follows.In section II we have given a brief introduction to data mining techniques.Section III discusses the models and methods involved in logistic regression.In section IV, we present experimental results of logistic regression.Section V concludes the paper with future studies.In the following, we have given some literature review about the applications of data mining and intelligent systems in acute coronary syndrome (ACS).
Lavesson et al [6] applied several data mining techniques to predict the severity of an ACS based on electrocardiograms.Only two classes unstable Angina (UA) and Myocardial Infarction (MI) were assumed as values of dependent variable.Based on 28 features, they evaluated different types of features selection techniques and applied supervised neural network for prediction model.McCullough et al. [7] used neural networks to examine 13 features with single target output that indicate symptoms of ACS with separate models for males and females.They also compared the results with receiver OC curves.Kostakis et al [8] investigated the patterns in cardiovascular risk factors with their matched controls.They discussed the application of OLAP-specific procedures in order to explore hidden pathways associated with risk factors among patients and controls.Rao et al. [9] proposed a probabilistic framework for Re liable Extraction and Meaningful Inference from Non-structured Data (REMIND) that integrates the structured and unstructured clinical data in patient records to automatically create high-quality structured clinical data.REMIND also performs inference with data from multiple sources and to enforce consistency between different medical conclusions drawn from the data --via a probabilistic reasoning framework.Scott et al [10] discussed the measurements and quality checking of care in health care patients especially, acute coronary syndromes.Massad et al. [11] reviewed the current state of the art of logic applications in medical diagnosis.Tamil et al [12] reviewed feature extraction and classification method for bio-signal processing which concentrates on electrocardiogram (ECG) signal processing.They in depth discussed the discrete wavelet transform for feature extraction and neuro-fuzzy logic for classification.Quteishat and Lim [13] discussed the intelligent data mining techniques like min-max neural networks to medical diagnosis.They choose real medical records from suspected ACS patients is collected and used for experimentation.

II. INTRODUCTION TO DATA MINING
Finding unrevealed information and useful patterns in a database is often referred to as data mining.The terms knowledge discovery, information retrieval, deductive learning and exploratory data analysis can be used in place of data mining.To accomplish different tasks, many different algorithms are involved in data mining.Usually the data mining scopes are partitioned into predictive and descriptive areas with application specific changes pertaining to the requirements of the problems.Making prediction about data values by using previously known results from some other data is done by predictive model where identification of patterns in data is made by descriptive model [14].
a. Principal Component Analysis: Dimension of a large data set can be reduced by using principal component analysis which is considered as one of the most popular and useful statistical method.This method transforms the original data in to new dimensions.The new variables are formed by taking linear combinations of the original variables of the form: Y 1 is the first principal component having the largest variance.As the direct computation of matrix B is not possible.So, in feature transformation, the first step is to determine the covariance matrix U which can be defined as [15]: The next step is to calculate the eigen values for the covariance matrix 'U'.Finally, a linear transformation is defined by n eigen vectors correspond to n eigen values from a m-dimensional space to n-dimensional space (n<m).Principal axes are also called eigen vectors .Mostly, the first few principal components contain most of the information.Using Analysis of variances' proportion tells how many principal components to be retained from the dataset [15].

III. REGRESSION MODEL
Regression allows forecasting future values on the basis of past values.The relationship's strength between two variables can be evaluated by bivariate regression [14].The following equation gives the general form of linear regression model: Here ∈ represents the random number, m represents the input variables and are called regressors. a 0 , a 1 , a 2 ,…, a m are the constants which are chosen to match the input samples.Because the number of predictors is more than one so it is sometimes referred to as 'multiple linear regression' that is a regression model in hyper-dimensional space [14].The data values that are exceptions to the expected data are called outliers.Mostly, the preprocessing step of the data mining model building steps included analysis of the outliers and interventions.

A. Logistic regression model
Modeling the probability of the event occurs as a function of linear set of predictors variable is referred as logistic regression model [15].The logistic regression model can be described as: Where, ( ) represents the expected value of the response variable, natural logarithms base is e and T is: Where, j ρ and j X are coefficients and predictors respectively for h predictors h ,..., 2 , 1 j = .

B. Testing hypothesis about the coefficients
In order to determine whether a specific predictor is significance or not, a hypothesis test is performed which is called Wald test see [16].It is defined as: where, SE refers to the standard error of the coefficient as estimated from the data.

C. Partial correlation
Partial correlation between each of the independent variables and dependent variable can be obtained with range from -1 to +1.Sample partial correlation coefficient estimates the measure of linear relationship between any two variables leaving the effects of the remaining variables [17].Partial correlation can be defined by the given equation:

D. Assessing the goodness of fit of the model
In a statistical model, how well a model fits an observation set is explained by goodness of fit [16].By analyzing the residuals, majority of the tests for goodness of fit of a model are carried out; although for binary (0-1) outcome variable, this approach is not good [17].The likelihood function ( ) which expresses the observed data probability [16].The log-likelihood function can be written as: Where, z i and ( ) i x π are the actual outcome and the predicted probability respectively of event occurring.

IV. RESULTS
As the data have the problem of curse of dimensionality, therefore, before proceeding for model fitting, first we have applied some data reduction technique to reduce the dimensions.After applying principal component analysis on the ten independent numeric variables, we have found that the first eight principle components cover more than 98% of the total variability of the continuous data space.Respiratory rate and hemoglobin have small eigen values and thus their influence is minimal on the information contents of the data set.We have observed that the data mining using data reduction resulted in better values of the performance indicators like mean square error and coefficient of determination.After data reduction, the fourteen independent variables are age, gender, smoke, hypertension, family history, diabetics mellitus, fasting blood sugar, random blood sugar, cholesterol, streptokinase, blood pressure (systolic), blood pressure (diastolic), heart rate and pulse rate.
Table-II presents the estimation the logistic regression model.This table gives the coefficients, standard error for coefficients, Wald statistics, and significance value for Wald statistic.

A. Test hypothesis about the coefficients
Table II represents the calculated Wald statistics and its corresponding significance level to test the null hypothesis for possible rejection.The significance level of smoking is 0 which indicates its higher prevalence in the risk of ACS.The positive coefficient of BPs, HR and RBS reveals that the risk of ACS increases with the increasing value of these factors.Similarly, the negative coefficient of BPd and PR indicates that the risk of disease increases with the decreasing values if these factors.

B. Classification of cases
Table-III represents the classification of cases predicted.Results show that 43 individual not having ACS were correctly predicted by the model which indicates that 37.1% of the individuals correctly classified without ACS.In the same way, 203 individuals were correctly predicted to have ACS i.e., 88.7% of the individuals were classified correctly with ACS.The off-diagonal entries show the number of individual that were incorrectly classified i.e.73 individuals not having ACS were classified incorrectly or we can referred it to as type-I error.In the same way, 23 individuals having ACS were incorrectly classified as not having ACS.Of 319 cases, 69.9% of the cases were correctly classified.Although, based on analysis, the false positive cases that is, those who have no ACS and they are predicted as having ACS is not very serious case.The most significant issue arises from true negative cases that are those who are diseased and predicted as nondiseased.Based on this discussion, if we focus ourselves at true negative cases then the error rate in the study reduces to 7.21% with prediction accuracy of 92.79%.Due to many internal and external reasons, like war against terrorism and internal financial crises, Karachi city has become the hub for the whole people migrating from other parts of the country.Therefore patient history diversity; physical, mental health, social status etc are variables of high impact on the acute coronary syndrome analysis.Thus, for such a noisy and volatile dataset, a model accuracy of 92.79% may be appreciated.

V. CONCLUSION AND FUTURE STUDIES
In this paper we have investigated factors which have higher prevalence of the risk of acute coronary syndrome.We observed that in comparison with other factors, smoking is the most significant factor.In future, we will extend this paper to obtain further improved results using outlier analysis and link analysis (association rule mining).We aim to investigate the effects of diet, environmental, social and fluctuations on acute coronary syndrome.Also, we will apply fuzzy learning models for further improved prediction of acute coronary syndrome.
World Academy of Science, Engineering and Technology 35 2009 In matrix form, we can write Z=B.Y, where b 11 , b 12 , …, b pp are called the loading parameters.The new axes are adjusted such that they are orthogonal to each other with maximum information gain.

(
model with no variable or a base model which contains the intercept only.
. In the data set, there were 104 patients without ACS and 215 patients with ACS.The data set is highly volatile and noisy due to the diversity of patients' history, physical, mental, social and economical classes.Even O World Academy of Science, Engineering and Technology 35 2009