Feature Selection Mammogram based on Breast Cancer Mining

ABSTRACT


INTRODUCTION
Breast cancer and cervical cancer are types of cancer that cause the highest mortality in women in Indonesia. Based on data [1] there are 330,000 cancer patients in Indonesia and the highest cancer prevalence is found in Yogyakarta Special Region of 4.1%. Breast cancer has established risks (e.g. family history, obesity, having dense breast) and emerging risks (e.g. low of vitamin D levels, unhealthy life style); therefore, the early detection can be conducted to reduce the mortality of breast cancer. In fact, if cancer are found in the early stage, there will be a great cure rate. However, mostly the cases of breast cancer in Indonesia are found in the advance stage because of low awareness. Mammography is one of the imaging technologies that can be used both for screening and for the diagnosis of breast cancer. Based on the BI-RADS lexicons for Mammography 2013, a hyperdensity mass with an irregular shape and spikulated margin is associated with malignancy. Other suspicious morphology is amorphous, coarse heterogenous, fine pleomorphic and fine linear or fine-linear branching calcification [2].
Currently a computer-aided system using the mammogram image with several different purposes has been developed, among others, to determine the level of risk of breast cancer [3], to detect the location considered abnormal in the image mammogram that is commonly called the CADE system [4], and to ISSN: 2088-8708  Feature Selection Mammogram based on Breast Cancer Mining (Shofwatul 'Uyun) 61 diagnose the type of breast cancer considered as RoI on the mammogram image that is commonly called the CADx system. The use CADx is as the second opinion in diagnosing the breast cancer based on the reading of the mammogram image.
In general, there are several stages to develop a computer-aided diagnosis system (CADx), among others: image acquisition, pretreatment, feature extraction, feature selection, classification and testing. At each stage, it needs to do the right choice of algorithm in order to be able to provide the diagnosis result accurately. In principle, the work pattern of CADx system follows the work pattern of pattern recognition system. One important factor that determines the success or failure in pattern recognition system is the use of the right features. According to [5] the right feature selection is a critical stage because the right features makes the pattern recognition system capable to distinguish between one object from another one in accordance with the characteristics of the object, one based on improved document frequency for the text classification [6]. Therefore, it is necessary to do the feature selection on a mammogram that is able to distinguish between benign from malignant lesions on the mammogram.
Some researchers developing a computer-aided system aim at assessing the risk factors, detection and diagnosis of breast cancer using the features found on the mammogram, including: color feature [7], texture [8], [9], shape [10] and a combination among the three [11]. The use of the right features greatly affects the performance of the pattern recognition system. In computation, it is expected to use the features as minimum as possible and to be able to distinguish between one class from another. Therefore, it needs an algorithm that can be used to choose the best features among so many features. Some previous researches have applied several algorithms aimed at the feature selection, among others: the branch and bound algorithm [12], hill climbing algorithm [13] and multi structure co-occurrence descriptor [14]. However, some existing references are not specifically used yet to select the features in the mammogram image for the development of CADx of the breast cancer system. This research proposes the use of several methods of data mining that are used as the feature selection algorithm of the mammogram image. The algorithms used are the decision tree and the rule induction, afterwards the classification is performed on the features selected from the two algorithms using several classification algorithms to measure the performance. Besides, this research uses the primary data, which types of lesions (benign and malignant) have been classified by the Radiologists not only based on the visual assessment but also verified based on the results of laboratory tests and assessment using other imaging technology that is ultrasound technology.

RESEARCH METHOD
This research uses the six-stage process for developing a computer-based system for the diagnosis of breast cancer, including:

Mammography Image Acquisition
This research uses the primary data in the form of mammogram image produced by digital mammography imaging technology that is conducted in Kotabaru Oncology Clinic Yogyakarta. The number of mammogram image successfully obtained from the probandus is 117 lesions of mammograms form two views, CC (Cranio Caudal) and MLO (mediolateral oblique). Furthermore, the Radiologists in this case as the researchers, conduct a visual analysis of the mammogram. In assessing the mammogram image, the Radiologists do not only interpret the mammogram image, but also match the interpretation result with the interpretation of the image that is the imaging results with other technologies, in this case using ultrasound technology and the results of pathology tests. In the analysis of the mammogram image, the Radiologists need to crosscheck to some test results using other data in order to provide the valid annotations on parts that are considered as the disorders / cancer, hereinafter referred to as RoI (Region of Interest). Besides providing RoI annotation on the mammogram image, the Radiologists classify it into two categories as benign lesions and malignant lesions. Data of 117 mammograms is divided into benign lesions amounted 79 benign mammogram and malignant lesions amounted 38 mammograms. The resulting image of mammography imaging has the same size that is 2424x3296 pixels, but the image of the cropping results, which is the annotations of Radiologists, has the very various sizes because it depends on the level of the vastness of the area of RoI itself.

Praprocessing
Interpreting the mammogram image is a very difficult job because the image resulting from the mammography technology has a very low quality. One of the characters is having a very low level of contrast that is very difficult to distinguish between the RoI from the fatty tissue. Therefore, before performing the feature extraction, the mammogram image quality needs to be improved, hereinafter referred to as pretreatment process that aims to get the better qualified image. Some processes performed at this stage include: normalization of mammogram image size to be 256x256 pixels with bilinear interpolation; removing the background of mammogram image with a rolling ball radius of 50 pixels; removing the noise by median filtering with a radius of 2 pixels; improving the image contrast using CLAHE (Contrast-Limited Adaptive Histogram Equalization) method with block size of 127, histogram bins of 256 and maximum slope of 3; besides using CLAHE to improve the image contrast also using equalization histogram with saturated pixel of 0.4%. The results of each stage of pretreatment are shown in Figure 1 and Figure 2.

Feature Extraction
One key to the success of a pattern recognition system, which in this case is CADx system, is the right use of features that are able to distinguish between benign lesions from malignant lesions. Therefore, it is necessary to study the use of algorithms that can be used to select and evaluate the use of features precisely. In general, this research performs the feature extraction on a mammogram image using two types of feature domain, shape domain (14 descriptors) and texture domain using the equation shown in Table 1 and Table 2. The use of features on the texture domain (24 descriptors) consists of the first order statistics and the second order statistics commonly called GLCM (gray level co-occurrence matrix). The feature of GLCM uses four directios ( , , ) and the average value for each feature with the four directios. Area of selection in square pixels Center of Massa the brightness-weighted average of the x and y coordinates all pixels in the image or selection Modal gray value the highest peak in the histogram Centroid the average of the x and y coordinates of all of the pixels in the image or selection Perimeter The length of the outside boundary of the selection Integrated density The sum of the values of the pixels in the image or selection Median The median value of the pixels in the image or selection Area fraction For thresholded images is the percentage of pixels in the image or selection that have been highlighted. For non-thresholded images is the percentage of non-zero pixels Stack position The position (slice, channel and frame) in the stack or hyperstack of the selection Circularity Circularity = x (area) (perimeter Aspect Ratio (AR) The aspect ratio of the particle"s fitted ellipse Roundness the inverse of Aspect Ratio Solidity Solidity= After praprocessing, the shape and texture domains are extracted on the mammogram image. There are three scenarios of experiments conducted as follows: first, using all the features of shape and texture amounted 38 descriptors simultaneously, second, using the shape feature with 14 descriptors, and third, using the texture feature with 24 descriptors. Scheme of feature use for both domains are shown in Table 3.

Feature Selection
The proper use of the features may provide the optimal classification results, besides, in computation, it may also reduce the burden of processor for unimportant data processing. Therefore, in this research the researchers conduct the data mining as the results of feature extraction with three nodes as noted in Table 3. To perform the feature selection, the mammogram image uses two algorithms those are decision tree and rule induction. Decision tree is a powerful and popular algorithm for classification and prediction. Its other advantage is being able to represent some rules that are easily understood by the humans and the knowledge can be used as the data in the database [16]. While the rule induction algorithm is one of the algorithms implemented on machine learning that is able to formulate some rules extracted from a collection of observation data. The results of data extraction in the form of a rule are the data model in the scientific form that represents some data patterns [17]. The example of the use of decision tree and rule induction for the first node with 38 descriptors is shown in Figure 3 and Table 4.
Some important features are obtained based on the results of mining using decision tree and rule induction for the 38 descriptors of mammogram images. The important features generated by the decision tree algorithm (see Table 5, scenario I) include: kurtosis, area fraction and mean, while the important features generated by the rule induction algorithm (see Table 5, scenario II) include: slice, mean, area fraction and contrast with the angle 135. The same thing is applied to node 2 and 3 using decision tree and rule induction algorithms, in which the mining results are shown in Table 5 (scenario III and IV) for node 2 and Table 5 (scenario V and VI) for node 3. The features used in the experiment with scenarios VII is an important feature generated by the first node using both the decision tree algorithm and the rule induction algorithm. The same thing is applied to scenario VIII and IX using the best features of the experimental results of node II and III. The detailed results of experiment can be seen in Table 5.

Classification
Having obtained some of the selected features for each scenario based on the decision tree and rule induction algorithms, then the researchers conduct a classification process of mammogram image into two classes, benign lesions and malignant lesions. In this classification stage, the researchers use several algorithms, among others: k-nearest neighbors (KNN), decision tree (DT) and Naive Bayesian (NB) that further will be expressed in the points of discussion. Based on the feature selection process in the previous process, there will be a classification process on ten scenarios predefined previously to measure the performance.

Evaluation
To evaluate the results of classification of some features based on the selected feature in each scenario, the data is automatically divided using the k-fold cross validation (with 10 k number) in stratified sampling way. Besides, this research also uses five statistical parameters that are commonly used in medical diagnostic result test including: accuracy, sensitivity, specificity, false positive rate (FPR) and true positive rate (TPR). The aim of using the five parameters is to know how reliable and consistent a system to make diagnosis of breast cancer. Accuracy is the amount of data that is successfully predicted correctly by the classification system either negatively or positively, in which the sensitivity is a measure of success of the classification system in identifying the positive data correctly and the specificity is a measure of success of the classification system in identifying the negative data correctly. FPR shows the average of positive cases identified as the wrong one and TPR for the opposite case. Associations between FPR and TPR parameters can be represented graphically that is called the ROC curve. The use of the ROC curves is to assist in making decision in the search for the best model for the diagnosis of breast cancer. The calculation of the five parameters is shown in Table 6.
General description for each stage is shown in Figure 4.

RESULTS AND ANALYSIS
The purpose of this research is to find the best features that are used to develop the CADx system for breast cancer on a mammogram image. Therefore, in this research, the researchers have conducted several experiments with ten scenarios, in which each scenario consists of six stages of research that has been described as shown in Figure 4. As an example for experiment with the first scenario, after shooting using the mammography technology, the researchers conducted several times a pretreatment process that has been described in detail in section 2.b. The output of these stages is the obtainment of mammogram images with better quality, so that visually the Radiologists can differentiate between fatty tissue and fat, which previously it was very difficult to distinguish between these two areas because it is a very thin network with no much different intensity. The next stage is to perform the feature extraction of 38 descriptors (a combination of shape and texture features); then the results of the feature extraction are selected using a decision tree (scenario I). The results of the mining process using a decision tree is a fact that not all features are able to contribute in determining the class of breast cancer (benign and malignant). There are only three descriptors that contribute as shown in Table 5. The next process is the stage of mammogram lesion classification into two classes (benign and malignant) using the algorithm of K-Nearest Neighbor (KNN), decision Tree (DT) and Naive Bayesian (NB). A classification is performed in the use of features for each scenario using three classification algorithms and there is an evaluation process using 10-fold cross validation. The complete results for each stage of the evaluation are shown in Table 7. The highest accuracy value is obtained at the CADx system to classify between benign and malignant lesions in scenario IV and VIII (using the five descriptors as shown in Table 5) with the classification algorithm of Decision Tree amounted 93.18%. The use of the five descriptors also provide values of FPR, TPR, Precision and Recall of 6%; 92%; 88% and 92%, while the rule can be used to classify both types of breast cancer as shown in Table 8.  Classification algorithm (KNN, DT and NB) is used for the development of CADx system with selected features (five descriptors); further, the testing is performed with 10-fold cross validation and visualized using RoC curve as shown in Figure 5. DT algorithm performs best performance when compared to the other two algorithms.

CONCLUSION
Based on the results of experiments and tests, it can be concluded that decision tree and rule induction, which is one of the mining algorithms, can be used as an alternative method for selection of features on the mammogram image. The results of feature selection conducted with ten scenarios obtain five descriptors that have contributed in the mammogram image classification into two classes, benign lesions and malignant lesions. The best classification results based on the five features (slice, integrated density, Area fraction, gray capital value, center of mass) are generated by the decision tree algorithm with accuracy, sensitivity, specificity, FPR and TPR of 93.18%; 87.5%; 3.89%; 6.33% and 92.11%.