Arabic Sentiment Analysis Using Deep Learning and Ensemble Methods

With the outbreak of social networks, blogs, and forums, classifying subjective text influenced by personal feelings and opinions has become an interesting research area. Many techniques have been proposed to solve the problem of analyzing and classifying sentiments held in those reviews and recommendations. Recently, deep learning models showed promising outcomes in many fields, including sentiment analysis. Therefore in this study, we propose a sentiment analysis deep learning-based model to predict the polarity of opinions and sentiments. Two types of recurrent neural networks are leveraged to learn higher-level representations. Then to mitigate the data dependency problem and to increase the model robustness, three distinct classification algorithms were utilized to produce the final output. Experimental results proved that our model prevailed in all the selected datasets with an accuracy ranging between 81.11 and 94.32%. Moreover, the model reduced the relative classification error rate by up to 26% compared to state-of-the-art models.


Introduction
Now, with having many diverse social media platforms, people can freely publish their opinions and feedback about products, individuals, subjects, places, policies, etc. As a result, the volume of generated text documents on the web is enormous, and it is overgrowing every day. Analyzing and classifying sentiments and opinions articulated in those text documents can offer huge opportunities as it has been proven that these published opinions can play a key role in our behavior and decisions. That fact has urged businesses and organizations to widely adopt sentiment analysis (SA) to aid in decision making, competitive analysis, strategy planning, identifying trends, and grasping consumers' needs. Consequently, this raised the researches' thirst to uncover the hidden power in those documents by focusing their efforts on SA and text processing [1].
Classifying sentiments held in users' generated text to favorable, unfavorable (positive/negative) or neutral based on its textual content is one of the fundamental objectives B Amal Alharbi amsharbi@taibahu.edu.sa 1 Taibah University, Medina, Saudi Arabia 2 King Abdulaziz University, Jeddah, Saudi Arabia of SA. However, sentiment polarity identification is not a straightforward task; it involves the need for linguistic knowledge, information retrieval, natural language processing, and deep understanding of the textual context [2]. Moreover, the correct analysis and classification of these opinions are decisive, which can have a huge impact on many levels. In the literature, many studies exploit machine learning techniques to tackle the SA task. However, after the recent prosperity of deep learning (DL) in many fields, data scientists and researchers have widely utilized DL to solve the task of SA. Additionally, deep networks have proven their effectiveness and efficiency with large datasets.
The Arabic language is among the most used languages on the web; it is considered in the top four most used languages by more than 226 million users [3]. However, the proposed DL models for Arabic SA are restricted in terms of number, diversity, and generalization on different types of text documents, which offer a room for improvement. Moreover, Arabic text analysis brings more challenges as Arabic has a rich and complex morphology along with multiple dialects besides its standard form. Therefore, proper representation of Arabic text is crucial as the model performance heavily depends on it.
Here, we address the SA task for Arabic text by proposing a model, Deep learning for Arabic Sentiment Analysis (DeepASA), based on two types of recurrent networks, Gated Recurrent Unit (GRU) [4], and Long Short-Term Memory (LSTM) [5]. The model composes of two main parts. In the first part, text documents represented by FastText realvalued vectors are passed as an input to GRU and LSTM networks to generate high-level features. Afterward, the output of both networks is fed to the voting-based ensemble method, which is majority voting; it consists of three machine learning classifiers to predict the class for each given document. The performance of DeepASA was evaluated on six different datasets, and the final results demonstrated that DeepASA outperformed the statt-of-the-art results in all the selected datasets. In fact, the classification accuracy achieved by DeepASA reached 94.32%, and also it was able to reduce the classification error rate by up to 26%. This paper continues by providing a brief preview of the related work. Following, the details of the proposed model architecture are presented. Then, to prove the efficiency of DeepASA, a series of experiments were conducted on a collection of different datasets followed by the final results and a comparison against state-of-the-art models with a brief discussion. Finally, the paper conclusion is presented.

Related Work
The number of studies and research work on Arabic language sentiment analysis are few compared to other languages such as English. If we only consider studies that implemented DL approach, the amount of those studies is even fewer [6]. Following, a brief overview of the recent research work done on Arabic sentiment analysis with a focus on DL solutions.
One of the early applications of the DL approach toward Arabic text for sentiment analysis is [7]. Here the authors inspected several models: Deep Neural Network (DNN), Combined Deep Belief Network (DBN), Deep Auto Encoder (DAE), and Recursive Auto Encoder (RAE). The Bag of Words with ArSenl lexicon [8] were used to generate the input feature vectors for DNN, DBN and DAE. While in RAE words, indices from the vocabulary are employed to form the input vector. Through the experiment, the RAE model showed better results than other models by achieving 74% accuracy.
In [9], the RAE model, proposed in [7], was improved to address the emerging complexity with processing Arabic language. The authors proposed a morphological tokenization and merged sentiment and semantic embeddings to cope with the rich morphology of the Arabic text and generate better representation. After testing the model, the results indicate that the new improved RAE model achieved higher classification accuracy, up to 86%, than the baseline model in [7]. The same team in [9] participated with their RAE model in Sem-Eval 2017 workshop for task 4, the message polarity classification, and the model prediction accuracy was equal to 41% [10].
In [11], a DNN model was presented to predict the polarity of Arabic tweets. The model composed of eight hidden layers with a final softmax layer to determine the class label of the input text. Testing the DNN model on various datasets showed an outstanding performance compared to traditional classifiers, where the DNN model classification accuracy reached 94%.
The authors in [12] used a Recursive Neural Tensor Networks (RNTN) model. The training of the model was done with a sentiment treebank while the input was embedded into vectors using the Word2Vec Continuous Bag of Words (CBOW) model. Experimental results indicate that the RNTN model can achieve a prediction accuracy up to 80%.
In [13], a combined model of Convolutional Neural Network (CNN) and LSTM is proposed. Different levels of word embeddings were used to represent input text to the model. Afterward, the input goes into the CNN layer then to an LSTM layer, respectively. Then the final prediction is generated at the output layer using a sigmoid function. The model was tested on different datasets in which it accomplished a classification accuracy results up to 95%.
A model was proposed in [14], which consists of two parts. First, is a three parallel CNN layers model followed by a concatenation layer then an output layer. The second part is a bidirectional LSTM model, forward and backward LSTMs, with a concatenation layer afterward and a final softmax layer as an output layer. The authors offered experimental results for each part separately, and both models together where the final output is calculated based on soft voting. The ensemble model prediction performance outperformed the CNN and the LSTM models by an accuracy equal to 65%. Two Word2Vec architectures have been examined in [15], the CBOW and Skip-Gram (SG), using crawled corpus from web pages. The corpus contains 3.4 billion words that was originally chosen from 10 billion collected Arabic words. The SegPhrase framework [16] was used to generate short phrases as to get better word embedding. Then, for sentiment classification, a CNN-based model with a sigmoid function at the output layer was trained on top of previously trained word embeddings. The model classification accuracy reached 91% as a result of testing it on different datasets with balanced and unbalanced samples. A comprehensive overview of research work done on Arabic sentiment analysis using DL approach can be found in [6].
Even though the previously mentioned studies are using diverse DL architectures, the applied text representation approaches or word embeddings are relatively similar. Also, these embeddings either involve a level of complexity as in [13] and [15], or they suffer from a poor representation of rare words. Therefore, such embeddings failed in handling the rich morphology of the Arabic language. Consequently, that can have a negative effect on the classification performance of those models. Moreover, the generalization of those models is limited as their performance is bounded on specific datasets.

Proposed DeepASA Model Architecture
This section characterizes the methods we used to solve the task of Arabic SA by proposing a classification model that exploits the integration of DL and ensemble learning. In the following, the details of the proposed model, DeepASA, along with the used word embedding approach are presented.

Input Layer
The input data to DL models are required to be transformed from words, sentences, or even entire documents into numerical representations. Word embedding is one of the most commonly used language modeling techniques that maps words into vectors of real numbers. With word embedding, words or entire documents such reviews/tweets can be represented by low-dimensional real-valued vectors. They provide efficiency and infer relationships between words, which can be practical in solving the SA task. Therefore, the input to DeepASA will be text representations generated from word embedding models. While the performance of the model can be affected by the quality of those representations to ensure the most appropriate representation of the text data, three different word embedding models were considered. Word2Vec [17], Doc2Vec [18], and FastText [19] were tested and evaluated with DeepASA. Furthermore, to select the best model along with the best set of hyper-parameters, a random search [20] on a subset of crucial hyper-parameters values, Table 1, was conducted. Experiments details and results are presented in Sect. 4.3.

Deep Learning Layers
The structure of DeepASA, laid in Fig. 1 consists of three main layers: an input layer, hidden layers, and an output layer. The input layer is initialized by wights (vectors) from the pre-trained word embedding model that shows the best results, as mentioned in Sect. 4.3. Then, in the hidden layers, two types of deep recurrent networks are used due to their ability to learn sequence data and long-term dependencies.
Here LSTM and GRU are utilized to progressively discover and learn high-level representations, which later are used as input features to the last layer of the model. The key difference between them is the number of gates, LSTM has three gates (forget gate, input gate, and output gate) while GRU has two gates (reset gate and update gate) to control and regulate the flow of information. Thus, GRUs have fewer parameters, which is translated into faster training and convergence. On the other hand, LSTMs may perform better on long sequence more complex datasets than GRUs [21]. Therefore, we decided to incorporate both networks to combine the efficiency of GRU and the performance of LSTM in order to generalize DeepASA on different datasets.
The LSTM and GRU networks were employed parallelly within DeepASA, in which both will be fed with the same embedded input x t . Thus, in each time step t the embedded vector (feature vector) x of one word from the input document will be observed by both networks. The number of time steps will vary from dataset to another depending on the maximum length of documents in each one. That is, within each dataset the maximum length is measured separately and according to that measure shorter documents will be padded by zeros. In LSTM the feature vector x t along with the previous hidden state h t−1 will enter the forget gate f t which will decide what information to delete from the cell state C t−1 . Then, the input gate i t will update the cell state C t . Finally, the output gate o t will produce the new hidden state that will be passed with the cell state to the next time step, see Equation On the contrary, the GRU will pass information between time steps depending only on the hidden state. The input to GRU gates will be also a combination between x t and h t−1 , Equation (4) and (5). It also have an update gate z t that is equivalent to the input and forget gates in LSTM. Then, the reset gate r t will determine how much past information will be neglected. Consequently, the GRU and LSTM layers will each generate a vector of real numbers as an output, x gru and x lstm , which is the final hidden sate for both networks. Those feature vectors will be joined in order to be later used as an input to the output layer. Thus, a merge layer will combine the features using an element-wise operation. Here, different types of merge layers were considered and examined independently with DeepASA, Sect. 4.4. Then, the resulting feature vector will be fed to a dense layer, which is used to flatten the vector feature dimensionality.
At the last layer of the model, an ensemble voting system is used to perform the classification task. Generally, at this layer functions like softmax and sigmoid are used to determine the polarity of the input text. However, in Deep-ASA we replaced those traditional functions with a majority voting system to boost the model's prediction performance. Since a binary classification problem is being solved in this study, three machine learning classifiers, C 1 , C 2 , and C 3 , will form the voting system. To ensure the selection of the best performing classifiers, we examined the performance of a set of different classifiers. Hence, the best three performing 1 the W terms denote the weight matrices, the b terms denote the bias vectors. classifiers will be selected to form the voting system to predict documents polarity according to Equation (6), where d denotes the input document.

Experiments and Evaluation
This section presents a description of the used datasets and the conducted experiments. Also, the results of DeepASA, along with a comparison against other models that solved the same problem, i.e., binary sentiment classification on Arabic text, are exhibited.

Datasets
Unfortunately, the amount of published Arabic datasets intended to be used for sentiment analysis purposes are finite. Furthermore, a minimal number of those datasets have a reasonable size to be used in DL models. DeepASA was experimented with the following datasets, where only positive and negative classes are considered.
• The Large Scale Arabic Book Reviews Dataset (LABR) [22] It is relatively large-sized dataset with about 63K Arabic book reviews gathered from goodreads.com. Each review is associated with 5-stars rating, which determines the orientation of the reviews as being either positive or negative. Here, reviews with 4-5 stars are positively labeled, while 2-1 stars reviews were considered negative. However, reviews with 3 stars were not included. • Hotel Reviews (HTL) [23] It consists of around 15K Arabic hotel reviews collected from TripAdvisor.com. The dataset contains positive, negative, and neutral reviews. • Restaurant Reviews (RES) [23] Scrapped restaurant reviews from Qaym.com and TripAdvisor.com were combined to from this dataset. It has around 10.9K reviews distributed among three classes positive, negative, and neutral class. • Product Reviews (PROD) [23] 15K reviews from Souq.com were collected and labelled with positive, negative, or neutral. • Twitter Data Set (ArTwitter) [24] ArTwitter is a tweets dataset with a reported 2K tweets, in which only 1.9K tweets were found that are labeled into either positive or negative. • Arabic Sentiment Tweets Dataset (ASTD) [25] With about 10K Arabic tweets, ASTD have four classes positive, negative, mixed, and objective. A basic cleansing operation was performed to prepare the datasets, such as: removing non-Arabic words, numbers, symbols, duplicate characters, stop words, and replace characters such as with . The class proportions in most of the datasets mentioned above are highly skewed, mostly in favor of the positive class, which can affect the performance of the model by yielding misleading accuracies. Therefore, the classes in LABR, RES, HTL, and PROD were balanced using the undersampling approach as done by Dahou in [15], depending on the size of the minority class. For the ArTwitter and ASTD the entire datasets are used as in [13]. Table 2 characterizes the distribution of datasets classes after been preprocessed.

Experiments Design
As mentioned earlier, DeepASA was trained/tested with six different datasets after being balanced. Those same datasets are employed to train the word embedding model with parameters given in Table 1, except the entire original datasets are used. Then, to train and test the model, a random split of 80-20% was applied. The 80% was used for training in which the last 10% of it was used for validation.
The Keras [26]library with TensorFlow-GPU 2 backend was used to build and train the networks. Also, the scikitlearn package [27] was utilized to implement and train the voting system. DeepASA 3 experiments were carried on a GPU to expedite the training process.
The performance of DeepASA is measured by a set of well-known and frequently used metrics in most classification problems, Equations (7-10). Accuracy (Acc), Precision (P), Recall (Rec), and F1 score are the metrics used with DeepASA evaluation process, which are found according to the following equations: 2 The tensorflow software library is available at https://www. tensorflow.org 3 The full code for DeepASA and the used datasets and other resources are available at https://zenodo.org/record/3864879#.XtDBjMBRU2x Where Where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

Word Embedding Selection and Training
Proper representation of raw text documents can have a significant impact on our model prediction performance. Instead of using one of the publicly published pre-trained word embedding models, we relied on training our models. Since most pre-trained models are trained on a single source corpora, either web data, twitter data, or Wikipedia articles. As a consequence, those models will poorly generalize on different types of datasets. We chose three distinct word embedding models to generate text representations to ensure the usage of the most appropriate embedding with DeepASA. The chosen models are Word2Vec, Doc2Vec, and FastText. Each one of those models is associated with a set of hyper-parameters which can affect the quality of the resulting embedding. We focused on optimizing key hyper-parameters, namely model architecture, window size, and vector size. The search for the best set of hyper-parameters can be a time-consuming and computationally expensive task; thus, a random search approach was followed instead of grid searching. The models were trained on corpora composed of combining all the selected six datasets, Sect. 4.1. The datasets were preprocessed and used with their original sizes, as stated in Table 2, which resulted in corpora with a vocabulary size of around 255k.
The evaluation process of the different word embedding models is carried out by testing them with DeepASA. Moreover, the experiments were conducted on all the six selected datasets because an embedding model can perform well on a specific dataset but not as well on other datasets. The experimental results in Table 3 show a strong correlation between proper text representation and DeepASA prediction performance. FastText and Word2Vec exhibited relatively comparable performances, though higher results were acquired on short documents, ArTwitter and ASTD, when using FastText models. On the other hand, Doc2Vec fell short in improving the model ability to learn from the provided text documents, which was reflected in the model performance. FastText has the advantage of representing each word by breaking it into its character n-grams, and the sum of its n-grams vectors gives the final word vector. That quality makes FastText representations a perfect fit for morphologically rich languages such as Arabic. Whereas, Word2Vec and Doc2Vec represent each word by a distinct vector. As a result, the FasText model with SG architecture was found to achieve the best results among different datasets; the optimal hyper-parameters are underlined in Table 3.

Merge Layer Selection
In the model a single layer of LSTM and GRU was used each with 32 neurons. As a result, two vectors of 32 in length will be produced as an output, x lstm and x gru . Those vectors will serve as input to the voting layer, therefore a merge layer is needed to join those features. Three merge layers: add, multiply, and concatenate, were considered and examined independently, Eqs. (11)(12)(13) . Table 4 shows the impact of incorporating different merge layers on the model performance in terms of accuracy. The results indicate that the add layer produced the best set of joined features, reflecting positively on the model's performance. While multiply and concatenate layers performed relatively equally yet had lower results than the add layer. For that, the add layer was used with DeepASA to produce a total of 32 features. Then, the dimensions of the resulting features will be reduced from 32 to 16 using a dense layer with a hard-sigmoid activation function.

Classifiers Evaluation
The prediction performance of the proposed model depends heavily on a combined set of three different machine learning classifiers, which forms an ensemble voting system. The final class label of the input document is predicted based on the most frequent class label that has been predicted by the classification models. Therefore, we evaluated the performance of six different classifiers, Support Vector Machines (SVM), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Random Forest (RF), K-nearest Neighbor (KNN), and Gaussian Naive Bayes (GNB). Those classifiers were assessed with DeepASA on six different datasets. The input to the selected classifiers will be the output of the dense layer which is a vector of size 16. All the classifiers are used with scikit-learn default settings except for the following: • In RF the random state is set to 20.
• The number of neighbors is 2 in KNN.
• SGD loss function is set to log.
Although the performance of all six classifiers is roughly close, see Table 5, we can notice that SVM, GNB, and LR produced the highest accuracies among all the tested datasets.

Results
After training and testing DeepASA on the six selected datasets, the results came as following: the model scored the highest accuracy when trained and tested with HTL dataset by reaching a prediction accuracy equal to 94.32%. The lowest performance of DeepASA resulted when the PROD dataset used to train and test the model, in which the model achieved an 81.11% accuracy rate. Figure 2 illustrates the test results for all the datasets with the selected evaluation metrics. When comparing the final voting results against single classifiers, as shown in Table 5, we can notice that some classifiers had a slightly better performance than the voting system. In particular, SVM had better results than the majority voting with a difference of less than 1%. Also, we can see that on average the voting system has a slight advantage over simple sigmoid and softmax functions. Nevertheless, we employed majority voting to overcome the problem of data dependency and to be confident about each data-point classification, by providing a more robust and reliable model.
A comparison against state-of-the-art models was conducted to demonstrate the significant improvement that DeepASA achieved. Tables 6 and 7 show that based on accuracy, DeepASA has higher performance in all the six selected datasets. The results of [15] were reproduced using the provided implementation but with our balanced datasets. While in [13], [28] and [29] the published results were taken as they are. Then, we measured the error reduction rate by finding the error rate for each model, given in Equation (6). Afterward, the relative error rate is calculated according to Equation (7) [30], in which M1 denotes the state-of-the-art model and M2 is DeepASA. As shown, DeepASA was able to reduce the classification error rate by up to 26.1% compared with stateof-the-art results.

Discussion
The experimental results in the previous section illustrate DeepASA capabilities of performing binomial classification for given documents into either positive or negative class. The model has effectively succeeded in outperforming and reducing the classification error rate in all the datasets when compared against other studies, as shown in Tables 6 and 7 . When testing DeepASA with LABR and HTL datasets, the classification accuracy reached 85.74% and 94.32%, respectively, whereas Dahou, et al. model in [15] archived 85.10% and 94.17% which indicate that DeepASA outperformed those results by a narrow margin. The used word embedding model in [15] was trained by quality phrases extracted from large corpus to allegedly enhance the quality of the Arabic word embeddings. Hence, the granularity of text data is converted from words to phrases in order to capture phrases such as and which means under the microscope and unacceptable, respectively. Apart from the complexity of the word representation approach used in Dahou, et al. model, the model's performance in other datasets, especially PROD and ASTD, was drastically lower than that of DeepASA, which signifies the fact that DeepASA exhibited high performance on different types of datasets with a much simpler embedding model, which is preferable. In fact, the accuracy performance of DeepASA exceeded Dahou, et al model by 2.35% to 7.52%.
The results of both [13] and [28] are reported with respect to the best performance given, based on the usage of two different embedding models. To clarify with ArTwitter in [13], a word-level embedding was used while a ch5gramlevel embedding was utilized with ASTD. A Word2Vec embedding was adopted by [28] to employ the CBOW model with ASTD and SG model with ArTwitter dataset. While the results of DeepASA was compared using only one embedding model. Nevertheless, DeepASA outperformed both models by up to 5.73% compared to [13] and between 1.73% to 1.8% against [28].
After careful examination of the missed classified samples, we found that the performance of DeepASA were affected by several factors. First, the wrong labeling of some documents was present in all the selected datasets. DeepASA was able to predict the classes of those documents correctly, but the actual labels given for them were wrong, Table 8 shows a sample of the missed classified documents as a result of wrong labeling along with our own annotation of what we think is correct against the actual given label and DeepASA prediction. Such a problem could indicate that it may exist in the training set, which might confuse the model's learning process. We also found that part of the incorrectly classified documents hold both positive and negative sentiments, and we can see this on all the review datasets, as shown in Table 9. The occurrence of these mixed sentiments has confused in determining the overall polarity. Not only mixed sentiment caused that confusion, but we also found several Note: er denotes error rate Table 8 A sample of the missed classified documents Table 9 Samples of documents containing mixed sentiments bipolar words that can convey positive or negative meanings depending on its context. Words like , , , , and are typically used in the modern standard from of Arabic to express meanings like terrible, dangerous, deadly, insane, and horrible, respectively, while it is commonly used in dialect to express positive feelings when someone is extremely impressed.

Conclusion and Future Work
A DL-based model, DeepASA, capable of predicting binary sentiment polarities in Arabic text, was proposed. With Deep-ASA, the effect of employing multiple word embedding models was explored, and FastText model was found to generate the best text representations which was reflected in the model performance. In addition, the benefit of employing distinct machine classifiers at the last layer was exploited to acquire a reliable robust model. That has allowed for better generalization on different datasets and moderately contributed in increasing the model performance.
The results of testing DeepASA on different datasets, reviews and tweets, were reported. Then to exhibit the improvement that DeepASA has accomplished, a comparison against other models was conducted. DeepASA has outperformed all the models in all the selected datasets in terms of the classification accuracy which in turn showed it's ability in reducing the classification error rate by up to 26%. In addition, the examination of the missed classified samples have revealed the existence of problems such as inaccurate labeling, mixed sentiments, and bipolar words which have definitely affected the model performance.
With DeepASA, the possible direction of research is to further train the FastText model with different more large corpora which can contribute in improving the current results. Apart from this, we could use more recently introduced word embedding models like BERT [31] or ELMo [32] for text representation instead of FastText. Moreover, we can consider to broaden the classes coverage to include the neutral class beside positive and negative classes.