Neighbor Weighted K-Nearest Neighbor for Sambat Online Classification

ABSTRACT


INTRODUCTION
Electronic government (e-government) has become an emerging trend for the past two decades. Nowadays, e-government is not limited to the developed countries. There are some innovative e-government application in the developing countries, as ICTs are being growingly used by government and connect it more closely with their citizens. With the application of e-government, two-way communication between citizens and government can be developed easily. Citizens can convey their aspiration, critics, or opinion to the government without any difficulties [1]. SAMBAT Online is one of the implementation of e-government provided by Diskominfo (Communication and Information Department) of Malang city government. SAMBAT Online is an application for complaint system that enable people of Malang city to express their opinions, suggestions, criticisms, questions or complaints about the performance of public facilities or services held by the government. Furthermore, Diskominfo will verify and accept all incoming complaints. They also have to sort and classify the complaints based on the intended department manually. Obviously, with the large number of incoming complaints, this process is expensive and takes a lot of time. Hence, an automatic complaints classification is required.
Sambat Online classification can be considered as topical text classification. Various traditional machine learning methods have been applied to solve this problem such as Naïve Bayes [2][3][4][5][6], Support Vector Machines [7][8]. K-Nearest Neighbors [9-12], Neural Network [13][14]. These methods have been shown to provide excellent performance in text classification. However, Sambat online dataset is an imbalanced data. The performance of these methods has encountered a significant drawback when dealing with imbalanced data [15][16]. The imbalance data issue rises frequently in clustering and classification scenarios when the amount of data with a particular class is much more than the data in the other classes [17]. Traditional machine learning methods tend to be flooded by the major class and neglect the minor ones as they are applied to such skewed data [18].
One of the improved machine learning methods devoted to tackle the issue of imbalanced data is Neighbor Weighted K-Nearest Neighbor (NW-KNN). NW-KNN is an improved K-Nearest Neighbor (KNN) method proposed by Tan [19] that adding a weighting stage to solve imbalanced data problems. This method assigns a small weight value to the neighbors coming from the majority class and assigns a larger weight value to the neighbors from minority classes. This method proven to obtain significant improved performance on imbalanced data.
In this study, we implement the NW-KNN method for Sambat Online classification. We use cosine similarity for measuring text proximity to determine neighbors in NW-KNN. We also use N-gram features to improve the performance of this classification method due to its promising performance as combined with cosine similarity [20]. By applying the NW-KNN method supported by N-gram feature extraction, it is expected that the classification system can handle the imbalance data classification problem well.

RESEARCH METHOD
As depicted in Figure 1, Sambat Online classification in this study is compsed of three majir phases: 1) preprocessing; 2) N-gram feature extraction; and 3) classification using NW-KNN.

Document Preprocessing
Preprocessing is a process that aims to prepare raw documents before being processed, either from training documents or test documents. There are some steps included in document preprocessing stage incuding tokenization, filtering, and stemming. In the first step, the document is splitted into smaller units called tokens or terms [21][22]. In this step, all of characters are converted into lowercase and punctuation, numbers, html tag and characters outside of the alphabet are also removed. The next step is filtering or removing uninformative words called stoplist based on an existing stoplist dictionary by Tala [23]. The fourth step is stemming. In stemming, every words is converted to its root [24][25]. For example, the words "jalan", "dijalankan", and "perjalanan" will be converted to the same word "jalan".

N-Gram Features Extraction
N-Gram is a slice of n-word obtained from a document [26]. The n can varies from 1 (unigram), 2 (bigram), 3 (tigram), 4, and so on. In this work, we use unigram, bigram, and combination of them. For example, if we have a document that contain a sentence: "we eat rice", then the N-gram features of this document is presented in Table 1. Furthermore, we represent the features with TF.IDF weighting. TF.IDF is the most highly employed term weighting algorithm in document classification [27]. TF.IDF incorporate term frequency (TF) and inverse document frequency (IDF). The TF.IDF weight of term feature t in document d is formulated as follows: and t df is the number of document in dataset that contains feature t. This feature representation will be used in the classification stage.

Classification using NW-KNN
The last stage is document classification using Neighbor Weighted K-Nearest Neighbor (NW-KNN). Each complaint will be classified based on the intended department. NW-KNN is a modification of KNN algorithm to solve the problem of imbalanced data. The initial stage is finding k nearest neighbors by calculating the distance or similarity between the testing and training data. Cosine similarity is used in this study for those task. The application of NW-KNN algorithm is not much different from traditional KNN algorithm. The only difference between the two algorithms lies in the calculation class weight. In traditional KNN, each class has the same weight. On the other hand, NW-KNN give the minority class a greater weight, while the majority class will be given smaller weight. The weight of each class is calculated as follows: Where is the weight of class , ( ) is the number of training data in class , { ( ) } is the least number of data training in each class, and is a constant magic number that its value usually more than 1. In this study, we use 2 as the value. This weight, alongside with the k nearest neighbors, will be used to calculate the score for each class. The class with highest score will be the class of the test data. The calculation of the scores of each class can be calculated as follows: where ( ) is the score of class for testing data , is the weight of class , ( ) is a set of training data that located the k nearest neighbor of the test data , and ( ) is the similarity between training data and testing data . We employ cosine similarity for this measure. Meanwhile, ( ) is the binary weight that has value of 1 if training data is belong to class . Otherwhise, its weight will be 0. By using this formula, NW-KNN can handle majority class dominance in imbalanced data because it give lower weight for majority class and higher class for the minority one.

RESULTS AND ANALYSIS
The data used in this study is taken from SAMBAT Online. The text of the complaint is taken from 3 departments including Department of Transportation or Dinas Perhubungan (DISHUB), Department of Sanitation and Parks or Dinas Kebersihan dan Pertamanan (DKP), and Department of Public Works, Housing and Building Supervision or Dinas Pekerjaan Umum, Perumahan dan Pengawasan Bangunan (DPUPPB). Total data used is 310 divided into 237 training data and 73 test data. The training data consist of 27 data form DKP class, 49 data form DPUPPB class and 161 data from DISHUB. Meanwhile, the test data used consist of 13 data froma DKP class, 21 data from DPUPPB and 39 data from DISHUB class.
There are three experiment scenarios performed on this study. Firstly, the experiment is focused on the effect of k values of NW-KNN and finding the most optimal value of k. he following experiment is is focused on the effect of N-Gram as features for classficition using NW-KNN. In the last one, we will compare the performance of NW-KNN and conventional KNN method. We use precision, recall, and fmeasure for evaluation in all of these experiments.

K Value Variation Experiment
In this experiment, we performed a comparison of k values variations of 1, 3, 5, 7 and 15. Unigram (Bag of Word) is used for this experiment. Table 2 shows the result of this experiment. The results depicts that generally the performance of this classification system is decreasing as the value of k is getting higher. This is because the higher the value of k, the higher the probability of neighbors that have further distances are also considerably taken into consideration. This far neighbors can be the irrelevant for choosing the right class. The value of k=3 has the most optimal performance with 77.85% precision, 74.18% recall, and 75.25% f-measure value. However, the value of k=1 has the most inferior performance with f-measure value only 65.51% because it only consider one neighbor that can be very biased.

N-Gram Variation Experiment
In this experiment, the variety of N-Gram used were unigram, bigram and combination of both as feaures. This experiment is conduceted using k=3 as Table 3 shows the result. As seen on Table 3, unigram feature shows the best performance compared to the others with 77.85% precision, 74.18% recall, and 75.25% f-measure value. Meanwhile, the worst performance is obtained when bigram is employed with fmeasure value only 48.51%. This is because many of Bigram's terms, which is a combination of two words, rarely appear on more than one document. It is often only occurs in the document where the term is located. It is very different from unigram feature that only consist one word. It makes this fetaures can be occurs in a lot of documents.

NW-KNN and KNN Comparison Experiment
A comparison of KNN and NW-KNN algorithm is performed in this experiment. The unigram feature is used in this experiment with variations of k neighboring values used include 1, 3, 5, 7, and 15 as Figure 2 shows the result. The result depicts that generally NW-KNN algorithm shows a better performance than conventional KNN algorithms as the k value getting bigger. This is because the distribution of the amount of training data in each class is imbalanced. As the neighboring value of k grows bigger, KNN algorithmtend to consider far neighbors that often belong to the class that has the highest amount of training data. As the result, by using KNN, will be a lot of testing data that classified into majority class even though it should not. Meanwhile, this problem can be avoided by NW-KNN algorithm because it gives lower