Fusion of Random Projection, Multi-resolution Features and Distance Weighted K Nearest Neighbor for Masses Detection in Mammographic Images

ABSTRACT


INTRODUCTION
Breast cancer is the most common cancer in women worldwide, with nearly 1.7 million new cases diagnosed in 2012 [1]. Abnormal tissue screening using X-ray mammography is currently the most effective method of early detection of the disease [2][3]. The introduction of digital mammography gave the opportunity of increasing the number of commercial Computer Aided Detection (CAD) systems, which has significantly enhanced the radiologists' ability to detect and diagnose cancer and take immediate precautions for its earliest prevention [4]. One problem with CAD systems is due to a large number of false positive (FP) marks when high sensitivity is required [5]. Too many false positives may confuse the radiologist of the most common types of cancer among women all over the world is breast cancer. Great effort has been devoted in recent years to the development of CAD which propose a lot of features to reduce false positives [6]. However, many features are not key features of masses and they make high dimensions for classification.
In this paper, we introduce novel method using moment and basic characteristic of the masses. Block Difference Inverse Probability (BDIP) and basic features are calculated in different multiresolutions. Once the features are extracted, random projection [7] and k nearest neighbor (k NN) [8] with distance weighting are used to classify the suspicious areas into real mass or normal parenchyma.

PROPOSED METHOD 2.1. Database
In this study, we use mammogram database Mini-MIAS [9] to test the method presented. MIAS is the public database of Mammographic Image Analysis Society -an organization of United Kingdom research groups. This database includes 322 mammograms from 161 patients. Films taken from the United Kingdom National Breast Screening Program have been digitized to 50-micron pixel edge and presented each pixel with an 8-bit word. Every image in database always has extra information or ground truth as shown in Figure 1 from the radiologists about characteristic of background tissue, type of abnormality present, severity of abnormality, the coordinates of center and approximate radius (in pixels) of a circle enclosing the abnormality. Mini-MIAS database is a reduced type of the original MIAS database (digitized at 50-micron pixel edge) has been reduced to 200-micron pixel edge and clipped/padded so every image has size of 1024 x 1024 pixels.

Preprocessing
The aim of the step is to remove unnecessary information in mammograms such as label, pectoral muscle or other noise. To separate the breast region from image label, we just threshold the image and keep the biggest threshold region. The pectoral muscle in a mammographic image appears as a predominant density region. It can affect negatively the result of detection method [10]. For this reason, the region representing the pectoral muscle should be eliminated. In the mammogram, there are also some small bright spots which have gray level approximate that of circumscribed mass. Median filtering with a window of 3x3 is applied for eliminating these spots as illustrated in Figure 2.

Mass detection
In this stage, suspicious regions are extracted from the preprocessed mammogram. The radiologists should focus their attention to these extracted regions. The steps of this procedure are fully described in [11]. Shown in Detected ROIs are masked are masked as true positive ROIs (TP-ROIs) or false positive ROIs (FP-ROIs) as illustrated in Figure 3 based on the provided ground truth.

Feature extraction
In human vision, edges and valleys [12] in an image are very important features, especially valleys are fundamental in the vision perception of an object shape [13][14]. Block Difference Inverse Probability (BDIP) is the texture feature which measures the variation in intensities of an image block. It effectively extracts edges and valleys. The larger the variations of intensity or the size of the block, the higher the value of BDIP [15]. BDIP of a block of size WxW is defined as: where I(i,j) denotes the intensity of a pixel (i,j) in the block B.
As the detected ROI is not in size of WxW so we subtitute the term "W2"in above equation by size or number of pixels in the ROI to calculate the BDIP feature at first resolution, which then is just simply called BDIP. Multi-resolution basic features are calculated in the same manner as multi-resolution BDIP feature.

Random Projection
In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse [15]. In random projection, the original D-dimensional data is projected to a L-dimensional (L << D).

Indonesian J Elec Eng & Comp Sci
where X LxD , X DxN denote output and input matrix and R LxD is a random projection matrix.
The random matrix R can be generated using a Gaussian distribution. Achlioptas [15] has shown that the Gaussian distribution can be replaced by a much simpler distribution such as: , 1 with probability 1/6 3 0 with probability 2/3 1 with probability 1/6

K Nearest Neighbor
Let T = {(x i , y i ): i=1:N} denote the training set where x i is the training vector in m-dimensional feature space and y i is the corresponding class label. Given unknow x', class y' is assigned by two steps a. First, a set of k labelled target neighbours for the x' is identified and sorted in ascending order in term of Euclidean distance to x' . b. Second, the class label y' is predicted by major voting of it nearest neighbours.
A weighted voting scheme for kNN, which is called distance-weighted k nearest neighbor (wkNN) rule is proposed in [16]. In wkNN, the closer neighbors are weighted more heavily than the farther ones, using the distance-weighted function. Then the classification result of the query is made by the majority weighted voting a neighbor with smaller distance is weighted more heavily than one with greater distance: the nearest neighbor gets weight of 1, the furthest neighbor a weight of 0 and the other weights are scaled linearly to the interval in between.

RESULTS
The number of detected ROI is 1000 [11]. For each ROI, BDIP and basic features are calculated at n level. The maximal value of n is the minimal radius of a circle enclosing the abnormality provided in the Mini-MIAS database. Totally we have 2400 features. Different values of K are tested and value of K which gives highest sensitivity is selected. Figure 4 shows the performance with different K value. The selected value of K is 21 with sensitivity of 90 %.  Table 1 gives comparisons of our method to different approaches. It is obvious that our method provides higher sensitivity at lower number of false positives per image. On the other hand, we also compare the performance in terms of sensitivity, false positive per image, time of random projection and time of running between different sizes of random projection matrix. The results are given in Table 2. The result shows random projection help to reduce time of running. This tool should be effective with big data and a lot of features but in small data it can influence to other performance.

CONCLUSIONS
This study proposes a new method to detect masses in mammographic image based on combination of multi-resolution features and distance weighted K nearest neighbor algorithm. The highest sensitivity is observed with small false positive per image. Comparisons with other related works prove that our method is effective and has potential to be further investigated. When using random projection, this tool will be effective with big data. In the future, we will evaluate the method on larger set of mammograms and use different features.