A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column

Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology related to the vertebral column. There is great potential to become more efficient and effective in terms of quality of care provided to patients through the use of automated systems. However, in many cases automated systems can allow for misclassification and force providers to have to review more causes than necessary. In this study, we analyzed methods to increase the True Positives and lower the False Positives while comparing them against stateof-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized in the current research community.


Introduction
Over the years there has been an increase in machine learning (ML) techniques, such as Random Forrest (RF), Boosting (ADA), Logistic (GLM), Decision Trees (RPART), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) applied to many medical fields. A significant reason this has become the case is the capacity for human beings to act as diagnostic tools over time. Stress, fatigue, inefficiencies, and lack of knowledge all become barriers to high-quality outcomes.
There have been studies regarding applications of data mining in different fields, namely: biochemistry, genetics, oncology, neurology and EEG analysis. However, literature suggests that there are few comparisons of machine learning algorithms and techniques in medical and biological areas. Of these ML algorithms, the most common approach to develop nonparametric and nonlinear classifications is based on ANNs.
In general, the numerous methods of machine learning that have been applied can be grouped into two sets: knowledge-driven models and data-driven models. The parameters of the knowledge-driven models are estimated based on the expert knowledge of detecting and recognizing pathologies of the vertebral column. On the other hand, the parameters of data-driven models are estimated based on quantitative measures of associations between evidential features within the data. The classification models used in pathologies of the vertebral column have been SVM.
Studies have shown that ML algorithms are more accurate than statistical techniques, especially when the feature space is more complex or the input datasets are expected to have different statistical distributions [1]. These algorithms have the potential to identify and model the complex non-linear relationships between the features of the biomedical data set collected by Dr. da Mota, namely: pelvic incidence (PI), pelvic tilt (PT), lumbar lordosis angle (LLA), sacral slope (SS), pelvic radius (PR), and grade of spondylolisthesis (GOS).
These methods can handle a large number of evidential features that may be important in detecting abnormalities in the vertebral column. However, increasing the number of input evidential features may lead to increased complexity and larger numbers of model parameters, and in turn the model becomes susceptible to over fitting due to the curse of dimensionality.
This work aims to present medical decision support for those healthcare providers who are working to diagnosis pathologies of the vertebral column. This framework is comprised of three subsystems: feature engineering, feature selection, and model selection.

Pathologies of the vertebral column
Vertebras, invertebrate discs, nerves, muscles, medulla, and joints make up the vertebral column. The essential functions of the vertebral column are as follows: (i) human body support (ii) protection of the nervous roots and medulla spine; and (iii) making the body's movement possible [2].
The structure of the intervertebral disc can be injured due to small or several small traumas in the column. Various pathologies can cause intense pain, such as disc hernias and spondylolisthesis. Backaches can be the results of complications that are caused within this complex system. We briefly characterize the biomechanical attributes that represent each patient in the data set.
Patient characteristics: Dr. Henrique da Mota collected data on 310 patients from sagittal panoramic radiographies of the spine while at the Centre Medico-Chirurgical de Readaptation des Massues placed in Lyon, France [3]. 100 patients were volunteers that had no pathology in their spines (labeled as 'Normal'). The remainder of patients had disc hernia (60 patients) or spondylolisthesis (150 patients).
Decision support for orthopedists is automated using ML algorithms and techniques of real clinical cases that utilize the above biomechanical attributes. Following, we compare many ML models evaluated through this study. into classes. Predictive modeling uses samples of data for which the class is known to generate a model for classifying new observations. We are only interested in two possible outcomes: 'Normal' and 'Abnormal'. Complex datasets make it difficult not to misclassify some observations. However, our goal was to minimize those errors using the receiver operating characteristic (ROC) curve.
Literature suggests using an ordinal data approach for detecting reject regions in combinations with SVM. In addition, selecting the misclassification costs as follows: C low cost when classifying a class as reject and assign C high cost when misclassifying. Therefore, Reject=C low /C high =wr is the cost of rejecting (normalized by the cost of erring). The method accounts to account for the rejections rate rate and the misclassification rate [2].

Description of the data
It is useful to understand the basic features of the data in our study. Simple summaries about the sample and the measures, together with graphical analysis, form a solid basis for our quantitative analysis of the vertebral column dataset. We conducted univariate analysis which identifies the distribution, central tendency, and dispersion of the data.
The distribution table include the 1 st and 3 rd quartile, indicating 25% of the values that the observations demonstrate are less than or greater than the values listed (Table 1). Figure 1.

Correlation:
A correlation analysis provides insights into the independence of the numeric input variables. Modeling often assumes independence, and better models will result when using independent input variables. Below is a table of the correlations between each of the variables ( Table 2).
We made use of a Hierarchical dendogram to provide visual clues to the degree of closeness between variables [4]. The hierarchical correlation dendrogram produced here presents a view of the variables of the dataset showing their relationships. The purpose is to efficiently locate groupings of variables that are highly correlated. The length of the lines in the dendrogram provides a visual indication of the degree of correlation. For example, shorter lines indicate more tightly correlated variables ( Figure 2).

The feature engineering and data replication method
We developed a method which we termed Feature Bayes. This method makes use of a probabilistic model from synthetic data creation. Additionally, the data has been feature engineered and further refined through automated feature selection. In order to maximize prediction accuracy we generated 54 additional features. We define a row vector as =[a1 a2 … a6] using the original six features from the vertebral column dataset. N is defined as the number of terms.
The features were constructed as follows: 'Trim mean 80%' calculates the mean taken by excluding a percentage of data points from the top and bottom tails of a vector as such Information theory, 'Entropy', is the expected value of the information contained in each message received [5] and is generally constructed as 6 1 2 log a n a n = ∑ (2) 'Range' is known as the area of variation between upper and lower limits and is generally defined as max -min We developed 'Standard Deviation of A' as a quantity calculated to indicate the extent of Deviation for a group as a whole,   'Cosine of A' was generated to capture the trigonometric function that is equal to the proportion of the adjacent side to an acute angle of the hypotenuse, 'Sine of A' was generated to capture the trigonometric function that is equal to the relationship of the opposite side of a given angle to the hypotenuse, sin A '25 th Percentile of A' is the value of vector A such that 25% of the relevant population is below that value, '75 th Percentile of A' is the value of vector A such that 75% of the relevant population is below that value '80 th Percentile of A' is the value of vector A such that 80% of the relevant population is below that value, 'Pelvic Incidence Squared' was used to change the pelvic incidence from a single dimension into an area. Many physical quantities are integrals of some other quantity, 'Sum of pelvic incidence and pelvic tilt', 2 1 a n a n = ∑ (15) For each element of the row vector A we created a 'Cubed' value of j,, 'Difference of pelvic incidence and pelvic tilt', 3 , i j a (16) 'Difference of pelvic incidence and pelvic tilt', a 1 -a 2 (17) 'Product of pelvic incidence and pelvic tilt', 'Sum of lumbar lordosis angle and sacral slope',

Patient data generated with oversampling
The category 'Normal' was significantly underrepresented in the dataset. We employed the Synthetic minority oversampling technique (SMOTE) [6]. We chose the class value 'Normal' to work with using five nearest neighbors to construct an additional 100 instances.

Variance captured while increasing feature space
In an effort to reduce the dimensionality further we opted to use principal components analysis (PCA) to choose enough eigenvectors to account for 0.95 of the variance of the sub-selected attributes [7]. We decided to standardize the data rather than center the data, which allows PCA to be computed by the correlation matrix rather than the covariance matrix. The maximum number of attributes to include through this transformation was 10. We then choose 0.95 for the value of variance covered. This allowed us to retain enough principal components to account for the appropriate proportion of variance. At the completion of this process we retained 288 components.

Automated feature selection methods
We utilized a supervised method to select features, a correlationbased feature subset selection evaluator [7]. This method of evaluation takes into account the value of a subset of features by analyzing the individual predictive ability of each feature along with the degree of sameness between them. The preference is to have low inter-correlation while having subsets of features that are highly correlated. Furthermore, we required that the algorithm iteratively add the highest correlated features with the class given there was not an existing feature in a subset that had a higher correlation with the feature being analyzed. We determined that we would search the space of features subsets using greedy hill climbing improved with a way of retracing. This retracing was governed by an environment of consecutive non-improving nodes. We set the direction of the search by starting with the empty set of attributes and searching forward. Additionally we specified that five would be the number of consecutive non-improving nodes to allow before terminating the search. This method selected 19 attributes from the 60 features. Of those 19 features, only PT and GOS are original data inputs, representing approximately 11%; the other 89% are feature engineered (Table 3).

Evaluation and classifier
We used the receiver operator characteristic curves (ROC) which compare the false positive rate to the true positive rate. We can access the trade-off of the number of observations that are incorrectly classified as positives against the number of observations that are correctly classified as positives.
Area Under the Curve' (AUC) is the accuracy or total number of predictions that were correct,

Accuracy=True positive+True Negative/True Positive+False Negative+False Positive+True Negative
The misclassification rate or the error rate is defined as: Error rate=1-accuracy We use other metrics in conjunction with the error rate to help guide the evaluation process, namely Recall, Precision, False Positive Rate, True Positive Rate, False Negative Rate, and F-Measure [8].
Recall is the Sensitivity or True Positive Rate and demonstrates the ratio of cases that are positive and correctly identified,

Recall=True positive/True Positive+False Negative
The False Positive Rate is defined as the ratio of cases that were negative and incorrectly classified as positive,

False Positive Rate=False Positive/False Positive+True Negative
The True Negative Rate or Specificity is defined as the ratio of cases that were negative and classified correctly,

True Negative Rate=True Negative/False Positive+True Negative
The False Negative Rate is the proportion of positive cases that were incorrectly classified as negative,

False Negative Rate=False Negative/True Positive+False Negative
Precision is the ratio of the positive cases that were predicted and classified correctly,

Precision=True positive/True positive+False Positive
F-Measure is computed using the harmonic mean and allows some average of the information retrieval precision and recall metrics. The higher the F-Measure value, the higher classification quality,

F-Measure=2(Precision × Recall/Precision+Recall)
We simplified the task for classification by using a Naïve Bayes classifier which assumes attributes have independent distributions, and thereby estimate Essentially this is determining the probability of generating instance d given class cj. The naïve bayes classifier is often represented as the following graph which states that each class causes certain features with a certain probability [9] (Figure 3).
In order to emphasize the benefits of the incorporation of feature engineering, feature selection, and PCA, we referenced prior research using two standard learning models and the rejoSVM classifier [2]. All training and testing was uniformly applied as before.
Furthermore, we abandoned SVM as a base and instead choose to show the value of incorporating our methods within a simple Naïve Bayes algorithm [10][11][12][13]. Moreover, methods such as Feature Bayes may be used as a decision support tool for healthcare providers, particularly for those providers that have minimal resources or limited access to an ongoing professional peer network [14][15][16] (Tables 4 and 5).
Methods that produce high true positives and low false positives are ideal for medical settings. These allow healthcare providers to have a higher degree of confidence in the diagnoses provided to patients [17,18]. Given a small dataset, which is typical of biomedical datasets, feature Bayes helps to maximize the predictive accuracy that could benefit the medical expert in future patient evaluations [19,20] (Table 6).

Conclusion
The analysis of the vertebral column data allowed us to incorporate feature engineering, feature selection, and model evaluation techniques. Given these new methods, we were able to provide a more accurate way of classifying pathologies. The feature Bayes method proved to be valuable by obtaining higher true positives and lower false positives than traditional or more current methods such as revo SVM. This makes it a useful method as a biomedical screening tool to aide healthcare providers with their medical decisions. Further studies should be developed surrounding the analysis of the feature Bayes method. Moreover, a comparison of ensemble learning techniques using feature Bayes could prove beneficial.