Random Forest Approach fo Sentiment Analysis in Indonesian Language

ABSTRACT


INTRODUCTION
Nowadays, people tend to write their experience, feeling, opinions, and views about events, products or services in online platforms such as social media, blog, forum, shopping sites, or review sites. It makes online platforms become a source of highly valuable information for both consumers and producers. Customers get second opinions before purchasing some products or services. On the other hand, producers get information about what people think about their products or services and predict the public acceptance rate level. This information can be very useful for improvement and marketing strategies [1]. Sentiment analysis is a task of analyzing people's opinions from a piece of text in order to specify whether the sentiments are positive, negative or neutral. Sentiment Analysis have been obtaining popularity over the past years as a result of the rise of social media and online review website and, thus, the requirement of analyzing their sentiment in an effective and efficient way. Sentiment analysis is currently a major research field with many applications in a large number of domains such as election results prediction [2]- [4], stock market prediction [5], [6], products and merchants ranking [7], movie revenues prediction [8]- [10], learning evaluation [11], [12], and etc.
We can consider sentiment analysis as text classification problem with sentiment as its categories. Therefore, we can use supervised machine learning approaches to tackle this problem. This approach is very popular in sentiment analysis and proven to be very good in this filed. Some machine learning approach that have been used in this field for example Naive Bayes [13]- [17], Support Vector Machines [18]- [19], Maximum Entropy [20], Neural Network [21], [22] decision tree and K-Nearest Neighbor (KNN) [23]- [26].
In this study, we explore the use of Random Forest for sentiment classification in Indonesian language. Random Forest is an ensemble learning technique based on decision tree algorithm [27]. Random Forests have been incredible in recent years since the performance of this type of algorithms have surpass SVMs, Naïve Bayes and other machine learning algorithms for classification task in some domain like bioinformatics and computational biology [28]. We will try whether this type of ensemble methods still outstanding on sentiment analysis tasks. In this study, we will also explore the use of bag of words (BOW) features with some term weighting methods variation such as Binary TF, Raw TF, Logarithmic TF and TF.IDF.

RESEARCH METHOD
As depicted in Figure 1, sentiment analysis system in this study consists of three main stages, preprocessing, features extraction and classification using Random Forest. The ouptut of classification result is two category, positive and negative.

Preprocessing
The first stage of this system is preprocessing. This stage involves several processes including tokenization, case folding and cleaning. Tokenization is a task of splitting review text into smaller units called tokens or terms [29], [30]. Case folding is a task of making all of characters in review text become lowercase [31], [32]. Meanwhile, cleaning is a task of removing punctuation, numbers, html tag and characters outside of the alphabet. In this study, we don't employ stemming and filtering since in some previous works about sentiment analysis, stemming and filtering cannot improve classification performance.

Feature Extraction
Bag-of-word (BOW) features will be used in this study. Each document would be represented as a vector in a space terms with the unique terms from preprocessing stage become its features. The feature vector value is determined using some term weighting method. The most popular term weighting methods are Term Frequency (TF), Inverse Document Frequency (IDF) and the combination of the two, Term Frequency Inverse Document Frequency (TF.IDF) [33].
Term Frequency is assigning weights by assuming that each term have a contribution that is proportional to the number of its occurrences in the document [34], [35]. There are some popular variation of TF such as Binary TF, Raw TF, and Logarithmic TF. Using Binary TF, each document is represented as a binary vector. A term that occurs in a document will get value 1 in the document vector, otherwise a term that never occurs in a document will get value 0. This kind of term weighting does not consider the number of term occurrences, only 0/1 values. In contrast to Binary TF, Raw TF method does consider the number of term occurrences. A term will get value based on how many times it appears in the document. Meanwhile Logarithmic TF also consider the number of term occurrences. The difference is Logarithmic TF where d t f , is the number of the how many times term t appears in the document d.
Meanwhile, Inverse Document Frequency is a global term weighting that been counted by regarding the distribution of the term in the dataset. This term weighting will give higher value for a rare term, a term that only appears in certain documents. The weights of term t using IDF formulated as follows: where d N is the number of documents in dataset and t df is the number of documents in dataset that where term t appears.
The most popular term weighting is TF.IDF. TF.IDF is a multiplication of TF and IDF. The weight combination of term t in document d can be counted as follows [36]: is the TF value of term t in document d and is the IDF value of term t.

Sentiment Classification using Random Forest
The last stage is sentiment classification. Each review will be classified into positive or negative category. In this study, we employ random forest for the classification task. Random forest algorithm is a supervised classification algorithm. It is an ensemble learning technique based on decision tree algorithm [27]. This Ensemble technique combines the predictions of some base estimators constructed with decision tree algorithm to enhance robustness over an individual estimator. Random Forest grows a lot of classification trees, which is called forest. If we want to classify a new data, each tree gives its category prediction as one vote. The forest chooses the category that has majority voting. In general, the more trees in the random forest the higher accuracy results given.
Random Forests have been gaining popularity in recent years since the performance of this type of algorithms have outstanding for classification task in some domain like bioinformatics and computational biology. There also some works in text classification using Random forest such as for hatespeech detection [37] and authorship profiling [38].

RESULTS AND ANALYSIS
Experiment conducted by using 386 reviews taken from FemaleDaily. All of the reviews is in Indonesian language. Instead of using cross validation, Random Forest use out-of-bag (OOB) error estimate to get an unbiased estimate of the classification performance. OOB score range form 0 to 1. The higher OOB score the better classification performance, otherwise the lower OOB score indicates worse classification performance. In the experiment, Random Forest will be tested using several term weighting method including Binary TF, Raw TF, Logarithmic TF, and TF.IDF. The experiment is conducted using Scikit-learn library [39]. Theresult can be seen in Figure 2. 49 Figure 2 show that sentiment analysis using random forest give good performance with average OOB score 0.829. We can also see form Figure 2 that all of the four term weighting method has competitive result. The OOB score between is just slightly different. The best OOB score is gained by Raw TF by 0.837. The lowest OOB score is gained by Logarithmic TF by 0.821. In the second place is Binary TF with OOB score 0.829 and the third place is TF.IDF with OOB score 0.828. This result is actually surprising because usually TF.IDF can outperform any other term weighting method. However, since the score difference is not very significant, we can say that the term weighting method variation in study has no remarkable effect for sentiment analysis using Random Forest.

CONCLUSION
In this study, we explore Random Forest with several term weighting method for sentiment analysis in Indonesian Language. This system in this study consists of three main stages, preprocessing, features extraction and classification using random forest. The ouptut of classification result is two category, positive and negative. The experiment result showed that sentiment analysis using random forest give good performance with average OOB score 0.829. The result also depicted that all of the four term weighting method has competitive result. Since the score difference is not very significant, we can say that the term weighting method variation in study has no remarkable effect for sentiment analysis using Random Forest.