Indian Monuments Classification using Support Vector Machine

ABSTRACT


INTRODUCTION
Content Based Image Retrieval system has become a significant research issue as plenty of image data have been generated in areas like medicine, Fashion Design, art galleries, entertainment, education, manufacturing and more. QBIC System of IBM [1], Chabot of U.C. Berkeley, Photobook of Massachusetts Institute of Technology (MIT) [2], VisualSEEK [3] and MARS [4] are popular examples of CBIR software systems. The text based retrieval system involves manual annotation of images which involves problems like the vast amount of laborious task and most importantly human perception of the image. Different person perceives the same image differently [5][6][7].
For many years, a tremendous amount of multimedia data in the form of images, audio and video have been generated due to availability of cost-effective electronic devices like camera, mobile or Handy cam. These multimedia data have been shared, uploaded or emailed to relatives and friends staying away to make them feel that they had not missed the precious moments. Millions of such photographs are uploaded and it is almost impossible to manually classify these pictures as per the monuments people visited. In January 2013, the India was at 3rd position with 62. 6  Bollywood and Television industry are also considered as major sources of multimedia data (Movie stills, Posters etc.) As they release more than 2000 movies, a lot of music videos, produce serials and also organize events for 100 years [9]. It can be observed that pilot scenes and song sequences consist of the monuments in the background and thus movies promote tourism and increase the revenue of the country as a whole. Figure 1 depicts the presence of the monuments in scenes of popular movies.
Travel and tourism industry incorporate heritage, medical, business and sports tourism. The main objective of this sector is to develop and promote tourism, to maintain competitiveness of India as a tourist destination and to improve and to expand existing tourism products to ensure employment generation and economic growth [10]. The government is also promoting tourism through advertisements, campaigns and takes special interest to preserve the beauty of these monuments. Every year, lots of foreign delegates and tourists visit India too. As per India Tourism Statistics 2013, 6.97 million foreign tourists arrived with an annual growth rate of 5.9% and approximately 1145 million domestic tourists with an annual growth rate of 9.6% were observed [11].  Basanti' Movie. [12] 2. LITERATURE SURVEY Bhatt M.S. and Patalia T. P. [12] used Generalized Co-occurrence Matrix (GCM) obtained from HSV color space having 64 gray-levels with various distance values (3, 6, 9, 12 and 15) as input to Genetic Programming System. Genetic Programming evolved spatial descriptor using 15 Generalized Co-Occurrence matrices as terminals have been implemented with 7 functions and 2 operators. Each GCMs having size of 64x64 is considered as input. Obtained Genetic programming evolved spatial descriptor is tested on manually created Indian monuments database having four classes, namely 'Taj Mahal', 'Qutub Minar', 'Golden Temple' and 'India Gate'. Fitness function used in GP system is linear SVM with 10 fold cross validation. They obtained accuracy of 92 %.
Murala et al. [13] proposed Local Tetra Pattern (LTrP) as a new feature descriptor for Content Based Image Retrieval. LTrP uses four distinct values for encoding of information and also uses direction information. They highlighted advantages of LTrP over Local Binary Pattern (LBP), Local Ternary Pattern (LTP) and Local Derivative Pattern (LDP). Benchmark databases viz., Corel 1000 database, Brodatz texture database and MIT VisTex database were used for performance comparison. Youness et al. [14] proposed content based image retrieval based on 2-D ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques) and Gabor Filter. They used Brodatz gray scale image dataset, having 13 texture classes with 16 samples for each class, for the purpose of evaluation. They achieved an average precision of 80.19%.
Nazarloo et al. [15] applied content based image retrieval for gender classification. The face is one of the most important biometric of human and contains lots of useful information. For gender classification, they merged the Gabor Filters and Local Binary Pattern Features of the face. Self-Organized Map was used for the classification and achieved an accuracy of 92.5%.
Guo et al. [16] explained Completed Local Binary Pattern (CLBP) which is a modification to LBP. CLBP composed of Local Difference Sign Magnitude Transform (LDSMT) and center gray level value. They tested the proposed approach on CUReT and Outex texture databases. It is observed that sign component preserves much better local difference as compared to magnitude component and thus outperforms conventional LBP.
Das et al. [17], [18] highlighted the importance of classified query for content based image recognition based on niblack's thresholding. They applied binarization on Red, Blue and Green planes separately based on niblack's thresholding. On the binarized images, upper mean, lower mean, upper standard deviation and lower standard deviation are calculated. For each plane 4 features are extracted and thus a feature vector of size 12 is generated. They tested the system on benchmark databases 'Wang Database' and 'OT-Scene'. Highest precision of 0.838 and recall of 0.838 was achieved using Artificial Neural Network with a Multi Layer Perceptron for 'Wang Database' while highest precision of 0.753 and recall of 0.754 are obtained using Support Vector Machine for 'OT-Scene' database.
Streater, J [19] explains the design and implementation of genetic programming system for construction of feature descriptor for skin lesion images. 6 Generalized Co-occurrence Matrices (GCMs) of size 64x64 are constructed using RGB color space with 5 inter-pixel distances for 100 images. Highest accuracy achieved was 72%. Fisher's Discriminant Ratio, naïve Bayes classifier is also used to compare the results obtained by the SVM.
Extraction of information from movies has been focused since last 10 years. Hollywood Movie Database 51 [20], YouTube [21] and the Hollywood datasets [22] are challenging datasets used widely for action recognition. Action recognition from movies is a subset of general human recognition activity.
Laptev et al. [23] highlighted limitations of the human action dataset in a controlled environment and also described the difficulties faced during the recognition of real movie actions. They identified similarities and dissimilarities between action recognition from movies with object recognition in still images. The first task accomplished with the accuracy of 60% is the automatic annotation of the human action using the movie scripts. Inaccuracy is due to script video misalignment. Classification of human-action is the main goal, achieved in the paper with 91.8% accuracy. Experiments are carried out for 8 different actions. HoG, HoF, Spatio-Temporal Bag of Features (BoF) and a combination of the above are used with the non-linear support vector machine to achieve the desired result.
Lei Chen et al. [24] depict a top-down approach based on rules for video editing and audio cues to extract dialogue and action scenes. A finite state machine with an audio-based support vector machine (SVM) classifier is applied for detection of skin type. Classifier uses three features, namely: variance of zero crossing rate, silence ratio, and harmonic ratio. The precision and recall rates achieved are 76.56% and 81.6% respectively.
Doudpota et al. [25] focused on the impact and popularity of Bollywood movies in South Asia, Middle East, UK, USA and other parts of the world. They mined song sequences from the Bollywood movies. They used in the first part, Zero Crossing Rate, Spectrum Flux and Short Time, Energy as features in Support Vector machine for binary classification of extracting segment into music and non-music. In the second phase, extracted music segments are further classified into song and non-song sequences using Probabilistic Timed Automata (Song Grammar). An experiment was carried out on 10 Bollywood movies having 74 songs and out of which 69 were successfully extracted. Recall achieved is 93.24%, while the Precision is 87.34%.
Vaudeville et al. [26] have proposed integrated color and intensity co-occurrence matrix (ICICM) for content based image retrieval. The ICICM composed of four other matrices namely, ICICM CC , ICICM CI , ICICM IC and ICICM II . ICICMcc captures color perception of pixel p and color perception of the nighborhood of P while ICICM CI captures color perception of pixel p and intensity perception of the neighbourhood of P. The other two can be described similarly. ICICM is updated based on the weight which is a function of Saturation and Intensity. The development of ICICM is based on the properties of the HSV color space. They have tested the system with a combination of various color (C) and gray level perception (I) levels on two different datasets. The first database is constructed from general purpose images obtained from International Microcomputer Software Inc. (IMSI) while the other is constructed using a crawler. A web-based application for CBIR based on ICICM is available in the public domain for performing the experiments with images in our database as well as with externally uploaded images. Deselaers et al. [27] performed a quantitative comparison of image features like color histograms, invariant feature histograms, Gabor feature histograms, Tamura texture feature, local feature and region based features. Correlation among these features is analyzed. They focused on two image retrieval tasks: color photographs (WANG Dataset) and medical radiographs (IRMA Dataset). Low correlation exists among region features, image features, invariant feature histogram and Gabor histograms. The combination of these will produce better image retrieval for color photographs. Invariant Feature histogram gives 15.9% error on WANG Dataset while error rate of 29.2% is identified in IRMA Dataset. Similar kind of comparison describes all remaining features. It is clear that selection of feature depends on the task at hand and the combination of positively correlated features does not improve the classification result.
Desai et al. [28] highlighted the importance of monuments classification to archaeologists in an assessment of their findings and in classification. Art galleries and museums focus on visual aspects of objects. A CBIR system based on visual shape based feature and texture feature was developed. Morphological operations were carried out for shape extraction and Gray level Co-occurrence Matrix was used for texture feature extraction. Five different classes with a total of 500 images are collected and performance was compared with Canny and Sobel edge detection approaches.
Information about Indian movies can be obtained via Indian Movie Database [29]. Figure 2 provides a flow chart of a content based image retrieval system. An image query is the image file that is given as an input to the system. The features of the input are calculated. A query of the extracted features is then generated and is compared with all the other features of the image files present in the database. Based on similarity measures, the system retrieves the required image files from the database and presents it in the form of the result.

Preprocessing
A set of 500 input images is re-sized into 256x384 Resulted images are converted from RGB color space into Hue, Saturation and Value (HSV) color space [30]. The layers of human retina sense the light through rod cells and cone cells [31]. The gray-levels are perceived by rod cells at low-levels of illumination while at higher levels of illumination cone cells are also excited. The human perceives the color same as the HSV color space. RGB color representation is different and not as per human perception. Hue indicates the pure color, S indicates the percentage of white added in the pure color while V represents intensity. The HSV color space can be represented as a hexacone [32]. When saturation is zero, we get only shades of gray from black to white by increasing the intensity. Incident light composed of many spectral components, but causes loss of color information when saturation is low even though illumination is very high. By changing the saturation from 0 to 1, perceived color changes from shades of gray to pure color under the given hue and intensity. It is known that HSV color space has more discriminating power as compared to RGB color space. Figure 3(a) Displays RGB image while corresponding HSV image is shown in Figure 3 Figure 4 shows the pre-processing & Feature Vector GenerationFigure 5 covers several techniques which were merged together to generate the feature vector.

Generalized Co-Occurrence Matrix
Generalized C0-Occurrence Matrix is useful to extract the texture of the image. It is represented as 4-tuple (i, j, d, ϴ) [33]. Here, í' and 'j' represent gray levels, d is the distance between pixels p1 and p2. Graylevels of p1 and p2 are i and j respectively. ϴ is the angle between pixels p1 and p2.   It indicates how correlated a pixel is to its neighbour over the whole image. Energy It represents the sum of squared elements in the GLCM. It is also known as uniformity.

Homogeneity
It measures the closeness of the distribution of elements in the GLCM to the GLCM diagonal.
Contrast: , ( (4) p (i, j) represents count at position (i, j) in GLCM, µ denotes mean and σ indicates the standard deviation in the above equations. Here, small, medium and large distance values are considered to capture the span of the monument in the horizontal and vertical direction. For example, span of the 'Red fort' is an almost whole image in the horizontal direction (left-to-right) with Hue, close to Red whereas 'Hawa Mahal' is spanned in both horizontal and vertical direction with Hue close to Red. Thus homogeneity and correlation properties are high for 'Red Fort' in the horizontal direction while the same properties are high in the horizontal as well as in the vertical direction in 'Hawa Mahal'. Table 3 shows the generalized Co-Occurrence Matrix Used as Features

Local Binary Pattern and Centre-Symmetric Local Biinary Pattern
The Local Binary Pattern effectively captures texture information from the local neighbourhood. s(x) = 1 if x >=0 0 otherwise Here, nc indicates the graylevel of the centre pixel of 8-neighbourhood, ni indicates ith pixel of the neighbourhood. The signs of the differences in a neighbourhood are interpreted as N-bit binary number resulting in 2N distinct values in the binary pattern. The LBP features are robust against illumination changes, they are very fast to compute, do not require many parameters to be set, and have high discriminative power [36]. In CS-LBP, center symmetric pairs of pixels are compared. LBP produces 256 distinct binary patterns, whereas CS-LBP generates 16 distinct binary patterns. The robustness on flat image regions is obtained by thresholding the gray level differences with a small value T. In our proposed system, histogram of CS-LBP is generated for all 3 planes of HSV image resulting in 48 (16*3) while the histogram of LBP is obtained for all 3 planes of RGB image resulting into 768 (256*3) features.

Edge Histogram
Color information is obtained through histograms, an area information is added into feature vector using generalized co-occurrence matrix using different distance and direction, texture information is achieved using LBP and CS-LBP histogram. To add the structural (behavior at the edge points) information in the feature descriptor, canny edge detector is used with threshold 0.2 so that most prominent edges are preseved. Canny edge detector consists of smoothing, finding gradients, non-maxima supression, double thresholding and edge tracking by hysteresis [37]. For each detected edge point, 5x5 neighbourhood is considered and the mean and the standard deviation are calculated. The unique values obtained from these statistical properties vary for every image because the detected edge points are not fixed. It is observed that unique values are in the range of 2000-10000. Two Histograms with bin size 100 are generated for mean and standard deviation.

Fitness Function
Here, we have adopted the classification accuracy calculated by a linear SVM classifier on the training set as well as the testing set. We adopted tenfold cross-validation for which total dataset set is divided randomly into 10 equal-sized parts and perform ten repetitions of training the SVM on 9/10 of the set and testing on the remaining 1/10 [38]. The overall fitness "Er" is the average of the tenfold crossvalidation accuracy. In our case, the value of n is 10. Accuracy (i) represents the accuracy of fold I by the SVM. The fitness function is defined as follows: Er = (1-(∑ (SVM[accuracy(i)])/n)))*100 % (6)

RESULTS AND ANALYSIS 4.1. Implementation
Our goal is to classify above mentioned monuments from large repositories of photographs uploaded on the social networking websites. We have evaluated our proposed method using 64-bit MATLAB 2013a, 8GB of RAM running on the Windows 8.1 OS with i7 5th generation processor.

Datasets
The WANG database [39] is a subset of 1,000 images of the Corel stock photo database which have been manually selected and it forms 10 classes of 100 images each. It is shown in Figure 9. The WANG database can be considered similar to common stock photo retrieval tasks with several images from each category and a potential user having an image from a particular category and looking for similar images. The 10 classes are used for relevance estimation: given a query image, it is assumed that the user is searching for images from the same class, and therefore the remaining 99 images from the same class are considered relevant and the images from all other classes are considered irrelevant. Figure 10 shows Oliva and Torralba (OT-Scene) dataset (8 categories, 2688 images). The OT-Scene Database was also considered for the evaluation purpose [41]. Figure 11 displays the example images of Monuments dataset. The Most important task is the collection of data as no direct dataset is available for the task at hand. In our data set 10 different classes are considered, namely: ' Taj

Experimental Results
The performance of the system is evaluated based on Error Rate, Precision, Recall, Accuracy and F-Score [40]. Table 5 shows confusion matrix status for Wang and Monuments Dataset while, Table 6 focuses on OT-Scene dataset. Confusion Matrix for the Indian Monuments dataset is shown in Table 4.
Precision= tp / (tp + fp) Recall= tp / (tp + fn) Accuracy= (tp + tn) / (tp + tn + fp + fn) F-score= 2 * ((Precision * Recall) / (precision +recall)) (10) Here, tp indicates true positive, fp represents false positive, fn and tn are false negative and true negative respectively. The F-score is also known as harmonic mean of precision and recall.   Figure 12 shows the Receiver Operating Characteristic curve for Monuments Dataset. Similar curves can be easily plotted for the other benchmark databases.

CONCLUSION
Recently, Content Based Image Classification has generated successful applications in industries like agriculture, pharmaceutical, surveillance and many more. The tourism industry of any country plays a vital role in the economic growth of the nation. The presence of the monuments in the Bollywood movies and its impact on the tourism industry is highlighted as tourists prefer such places to visit. Feature vector is generated using Histograms, Local Binary Pattern, Generalized Co-Occurrence Matrix and Canny-Edge Detector. Ten popular Indian Monuments were considered and image database has been constructed. The system achieved an average accuracy of 97% with high precision and recall for the Indian monument database. The proposed system also works well on the other benchmark databases.