Improving Sentiment Analysis of Short Informal Indonesian Product Reviews using Synonym Based Feature Expansion

Sentiment analysis in short informal texts like product reviews is more challenging. Short texts are sparse, noisy, and lack of context information. Traditional text classification methods may not be suitab le for analyzing sentiment of short texts given all those difficulties. A common approach to overcome these problems is to enrich the original texts with additional semantics to make it appear like a large document of text. Then, traditional classification methods can be applied to it. In this study, we developed an automatic sentiment analysis system of short informal Indonesian texts using Naïve Bayes and Synonym Based Feature Expansion. The system consists of three main stages, preprocessing and normalization, features expansion and classification. After preprocessing and normalization, we utilize Kateglo to find some synonyms of every words in original texts and append them. Finally, the text is classified using Naïve Bayes. The experiment shows that the proposed method can improve the performance of sentiment analysis of short informal Indonesian product reviews. The best sentiment classification performance using proposed feature expansion is obtained by accuracy of 98%.The experiment also show that feature expansion will give higher improvement in small number of training data than in the large number of them.


Introduction
Sentiment analysis is one of the fundamental problems in natural language processing (NLP). Sentiment analysis involves analyzing people's opinions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes from a piece of text [1]. Sentiment Analysis has a number of applications, includi ng ranking products and merchants [2][3], predicting election results [4], predicting box-office revenues for movies [5][6][7], predicting the stock market [8], characterizing social relations [9], and etc. The rise of social media make us now dealing with much more short informal texts every day. Examples are tweets, status updates, comments, and reviews from various social platforms. Working with these short informal text genre is more challenging compared to traditional text genres because there are many limitations in this genre. Thus, there is growing interest in sentiment analysis of this kind of texts.
Sentiment classification is a special kind of text classification problem with two class, positive and negative. Since it is a text classification problem, any existing supervised learning method can be applied. In most cases, the use of statistical or machine learning techniques such as Naive Bayes, Maximum Entropy, and Support Vector Machines has proven to be successful in this field [10][11][12][13][14][15][16]. Some previous works also use another supervised learning such as method decision tree and K-Nearest Neighbor (KNN) to analyze sentiment within texts [17][18][19]. Those researches showed that standard machine learning methods using unigram (bag of words) as features perform very well in this field.
Short texts are sparse, noisy, and lack of context information. Traditional text classification methods like machine learning may not be suitable for analyzing sentiment of short texts given all those difficulties. A common approach to overcome these problems for analyzing sentiment of short texts is to enrich the original texts with additional semantics to make it appear like a large document of text. Then, traditional classification methods can be applied to it. Some of the previous works employ search engines to extract more information  [20][21][22]. The other works utilize external sources such as Wikipedia and WordNet as background knowledge [23][24][25][26].
In this study, we conducted automatic sentiment analysis of short informal Indonesian product reviews. This is very useful because it allows review to be aggregated without manual intervention. Consumers can utilize this information to research products before buying. Marketers can utilize this to research public opinion of their products. Organizations can also utilize this to get critical feedback about problems in their newly released products. The reviews are gathered from a social platforms that provides reviews from users about certain product. Every review in this platform is a short informal Indonesian text that express positive or negative opinion about the product. Most of the reviews are short texts with informal language, creative spelling and punctuation, misspellings, and slang word. This paper aims to improve short text bag of words representation for sentiment analysis. We developed automatic sentiment analysis system of short informal Indonesian texts using Naïve Bayes and Synonym Based Feature Expansion. In the first step, we counduct preprocessing normalizing misspellings and slang words. In the next we use Kateglo API (kateglo.co.id) to find synonym of each word in texts to enrich the original texts. Finally, we do classification using Naïve Bayes Classifier and bag of words as the features.

Research Method
In general, as seen in Figure 1, the sentiment analysis system in this study consists of three main stages, preprocessing and normalization, features expansion and classification. The first stage involves several steps including tokenization, stopwords removal, stemming and misspellings words normalization. In this stage, we also counduted negation convert. In the feature expansion stage, we use Kateglo API to find synonym of each word in the review texts. Then, the synonym will be added to the original texts. Finally, in the sentiment classification stage, Naïve Bayes is trained using some training data and the expanded review texts will be classified using Naïve Bayes and bag of words as its features.

Preprocessing
Preprocessing involves tokenization, stopwords removal, stemming and misspellings words normalization, and negation convert. Tokenization is an early process done to remove punctuation, numbers, and characters other than the alphabet [27][28][29][30]. Also in this stage will be coundected case folding, which is changing all capital letters into lowercase. Stopwords removal or filtering is removing uninformative words referring to the existing stopword dictionary. In this case, we use stoplist by Tala that have been used in [31]. Stemming is a process to convert every words to its root. This process is done by removing affixes such as prefix, infix and suffix. In this case, we use Nazief-Adriani Stemmer [32]. Misspellings words normalization is done by changing the words into its formal form. Examples are the word "ga" become "tidak" and the word "bisaaaa" become "bisa". Negation convert is a process of converting negation words contained in a sentence. The negation words has influence in changing the value of sentiment in a sentence. The most used negation words in Indonesian Language are "tidak", "bukan", "tak", "tanpa", "kurang" and "jangan". The negation convert is done by finding the antonym of the word that been negated. For example, negation convert of "tidak bagus" is "jelek".

Feature expansion
Feature expansion is process of enriching the original texts with additional semantics to make it appear like a large document of text [33]. In this study, we utilize Kateglo to find some synonyms of every words in original texts and append them. Kateglo is a dictionary website that provides API for fetching word attributes such as lexical class, root form, synonym, antonym, etc. Our system will find the synonyms of a certain word by sending some parameter to URL http://kateglo.com/api.php?format=json&phrase= [word]. Then, our system will parse the Json data received from the kateglo server and use them for feature expansion as seen in Figure 2.

Classification using naïve bayes
Naïve Bayes is one of the most effective and efficient inductive learning algorithms for machine learning and data mining. The performance of Naive Bayes is competitive in the classification process although it uses the assumption of attribute independence. The assumption of the independence of these attributes on the data is rare, but although the assumption of attribute independence is violated, the performance of Naive Bayes classification is quite high, as proven by various empirical studies [34][35].
The Naïve Bayes classification is incorporated into the Bayes learning algorithm constructed by training data to estimate the probability of each category contained in the characteristics of the testing document. In general, the classification process using the Naïve Bayes method can be seen in equation 1.

Results and Analysis
Dataset that was used in the experiment is mobile banking app reviews. There are 100 testing data consisting of 50 positive reviews and 50 negative reviews. Meanwhile, the training data used varies from 50, 100, 400 to 1000 training data. This experiment was conducted to determine the effect of feature expansion and the number of training data on the sentiment classification performance. The evaluation method used in this experiment is accuracy. Experiment results shown in Figure 3. As displayed in Figure 3, the best sentiment classification accuracy is obtained when the training data is 400 using feature expansion by 98%. The experiment results using feature expansion on every number of training data always have better classification accuracy compared to the ones that not using feature expansion. This results show that feature expansion increase the sentiment classification performance. In short-text classification, many words in the testing data never appear in the training data. It can damage the sentiment classification performance. Using feature expansion, the system append some new words, in this case the synonyms of each word in testing data, to the testing data before the classification stage. Therefore, the vocabulary on the testing data will be richer and the probability of training 1349 data and testing data share the same words will increase. Thus, sentiment classification process will produce better performance compared without using feature expansion. Also from Figure 3, can be seen that the number of training data does have effect on the sentiment classification performance either with feature expansion, or without feature expansion. The more the training data, the higher accuracy obtained. The highest accuracy difference between sentiment classifications using feature expansion and not using feature expansion occurs when training data used is minimal. This difference will be closer along with the increasing number of training data. This result show that feature expansion will have bigger influence in small training data. In the large number of training data, the word in testing data that will be expanded most likely has already appeared in the train data. Hence, in this case, using feature expansion does not increase the sentiment classification performance significantly.

Conclusion and Future Works
The proposed method, Synonym based feature expansion, had been proven can improve the performance of sentiment analysis of short informal Indonesian product reviews. Based on the experiment, Naïve Bayes classifier that use feature expansion always have better classification accuracy compared to the ones that not using feature expansion. The best sentiment classification performance is obtained when the training data is 400 using feature expansion by accuraty of 98%. The number of training data also affect on the sentiment classification performance either with feature expansion, or without feature expansion. The more the training data, the higher the accuracy obtained. The highest accuracy difference between sentiment classifications using feature expansion and not using feature expansion occurs when training data used is minimal. This difference is decreasing along with the increasing number of training data. This result show that feature expansion will give bigger improvement in small training data than in the large number of training data.