Sentiment analysis by deep learning approaches

We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models. the


INTRODUCTION
Great efforts are needed to develop machines that can mimic the natural ability of human beings to understand emotions, analyze situations and understand the sentiments associated with the context. The sentiment analysis is an effective mechanism to explore the socio-economic or demographic influence in human reciprocation. With the availability of a plethora of opinionated videos in social media, multimodal approaches in the sentiment analysis is gaining attention. Opinionated videos are highly unstructured; hence verbal and non-verbal cues are complementary in the sentiment analysis at this juncture. That means analyzing the communication in audio, visual along with text modalities has to be incorporated for achieving effective solutions. Most of the existing frameworks for classifying the sentiments are based on transcriptions based analysis [1] and the use of lexicons, but not much of the literature mines through the vocal and visual cues embedded in the videos. The voiced communication can give more information regarding the human empathetic conditions [2]. This work aims in fusing information from different modalities for the sentiment analysis.
The primary benefit of analyzing videos along with texts is that the rich set of behavioral cues present in audio and video recordings can yield enhanced models. The vocal modulations, facial expressions and gestures in the visual data, along with textual data, help to analyze the affective domain of the opinion holder in a better way. Thus, a combined text, vocal and visual data help to create a more robust and emotion specific sentiment analysis model [3]. There is an array of techniques available for carrying out brought to you by CORE View metadata, citation and similar papers at core.ac.uk provided by TELKOMNIKA (Telecommunication Computing Electronics and Control) TELKOMNIKATelecommun Comput El Control  Sentiment analysis by deep learning approaches (Sreevidya P) 753 the sentiment analysis, through incorporating machine learning and deep learning paradigms. There are multi-faceted challenges associated with extracting information from different modalities and to fuse them together for the analysis. We propose a bimodal approach for predicting the sentiments using deep learning based techniques.
The proposed deep sentiment analysis framework includes: a. A Convolutional Neural Network (CNN) based model with max-pooling, and dense layers to process features extracted from sentence level utterances. b. A model for processing transcriptions which is trained with CNN layers. The sentence level text is mapped into a vector space using a word representation learned by word embedding.
c. A fusion model containing the features extracted from specific layers of both audio and transcription models.
Conventionally, the problem of sentiment analysis is based on textual information. The analysis is carried out at word level, sentence level or document level. Pre-processing steps include cleaning of texts, removal of white spaces, expanding the abbreviations, stemming, removal of stop words, negation handling followed by feature selection and finally classification techniques [4]. The classification techniques can be divided into machine learning (ML) based approaches and lexicon based approaches. The ML based supervised learning approaches include probabilistic models such as Naive Bayes classifiers [5] or Bayesian classifiers [6]. Because of the sparse nature of the text data, the Support Vector Machines (SVMs) are effectively used for classifying transcription sentiments, both for multi-class and binary class problems. Li and Li [7] used SVM for classifying sentiments in micro blogs. Neural network and SVM were applied for sentiment analysis and compared by Moraes et al. [8].
The automated lexicon based approaches are split into dictionary based approaches and corpus based approaches [9]. The dictionary based approaches focus on finding the opinion seed word, whereas corpus based approach begins with a seed list of opinion words. The corpus based approach is limited due to the difficulty in preparing huge corpus and normally employs either statistical based techniques [10] or semantic based techniques [11]. With the increased presence of multimedia tools, especially on social media platforms, sentiment analysis could not be restricted to transcription based analysis. This has paved ways to multi modal approaches in sentiment analysis. While the unimodal text based analysis was focused at text pre-processing and selecting suitable methods for analysis, there were greater challenges in multimodal approaches. In conventional analysis, rule based methods using lexicons and data driven methods using large, annotated databases [12,13] are popular. But in multimodal analysis, the heterogeneous dimensions from image, text and audio signals are to be combined together. There are three strategies popular for multimodal fusion, viz, early fusion latefusionand intermittent fusion. The work in [14] apply early fusion of low level and mid level features extracted from human faces to have group level emotion detection. A major shortcoming of early fusion technique is the absence of detailed modeling for view-specific dynamics, which will affect the modeling of inter-view dynamics which causes overfitting of input data and models based on late fusion are normally good in modeling view-specific dynamics. Late fusions have shortcomings in modeling the cross-view dynamics since these cross-modality dynamics are considered to be more difficult [15]. The traditional hand crafted feature extraction methods paved ways to deep learning techniques, additionally, the Recurrent Neural Networks (RNN) and Long Short time Memory (LSTM) could take up the spatial and temporal information directly from the raw data [16].

RESEARCH METHOD
A bimodal approach with utterances taken in audio and text formats is proposed here forsentiment analysis. The MOUD dataset containing opinionated utterances in sentence level [13] is taken for experiments. The architecture developed is shown in Figure 1. Utterances audio andtext are the inputs of the framework and the output is binary classification-positive or negativepolarity. The architectural pipeline includes two parallel independent deep learning frameworkshaving unimodal processing of audio and text utterances. The deep neural features extractedfrom these individual modalities are fused together and given as input to the final CNN layers toapply the bimodal fusion.

Unimodal approaches
The proposed system intends to develop individual models for transcriptions and audio signals at the first stage. Later, a bimodal architecture is developed by integrating the independent models. Each stage is described as follows.

Audio features
Analyzing the speech as sound will help the system to focus on classifying the polarityof the sentence either as positive or negative by eliminating the language barrier. As for theaudio utterances are concerned, the audio features are extracted from the input audio signal bythe application of a third party acoustic feature extraction tool called OpenEar [17,18]. The featuresextracted are using SMILE feature extractor and Low Level Descriptors (LLDs) including13 Mel-/Bark-Frequency-Cepstral Coefficients (MFCC) which typically ranges between 300Hz to5KHz, prosody, energy, voice probabilities and spectral coefficients resulting in a feature vectorset 27 for each utterances. The features are extracted with a frame sample rate of 25ms andz-standardization is performed for speaker normalization.
This feature set is applied to a deep learning framework starting with a convolution layer including256 filters of size three. The convolution layers are interleaved with a max-pooling layer sothat the filter output size is reduced by factor of two. The network goes deeper in this fashion byimplementing convolutional layers of size 3 with the number of filters as 128 and 64 respectively.After the convolutional operations, three consecutive dense layers are added to flatten the networkand gradually reduce the output. The max-pooling layer will reduce the dimensions of theset of feature by a factor of 2. Next, a dropout is applied as a regularization technique in orderto reduce the number of network connections which helps to avoid the overfitting of the networklayers just before flattening the layers. The non-linear Rectified Linear Unit (ReLU) is applied asthe activation function for the hidden layers and final decision on type of the sentiment is basedon the output of the softmax function.
The transcribed utterances in the MOUD dataset, which is annotated is combined intoa single CSV file. Initially the database undergoes certain pre-processing steps so as to avoidthe outliers. Subsequently, the data is given to a tokenizer to create the vocabulary. The wordembeddings are used to get the word vectors. It is like the concatenation of words. This featuresets are trained in a deep neural framework II to carry out the output the sentiments classification.

Textual features
Primarily, text data must be encoded as vectors before applying it to the deep learning model. For hat (i) sentences are pre-processed and tokenized to get the integer representation. The start and stop words as well as the wild characters are removed during pre-processing. At the same time all the words are onverted to lowercase letters. Keras Tokenization API is usedfor tokenizing the sentences. (ii) Finally, The word embeddings are applied to convert the positiveintegers to dense vectors of fixed size. The dense vectors represent the projection of the wordsinto a continuous vector space whereby each word will have a unique vector representation. Asa result, the words will be in a coordinate system, where, related words based on the corpusrelationships will be placed closed to each other. The vector values are learned in a way TELKOMNIKATelecommun Comput El Control  Sentiment analysis by deep learning approaches (Sreevidya P) 755 thatresembles to the method of learning in a typical neural network [19]. The feature vectors obtainedare padded to a window of standard length of 60. This standardized vectors are given as input ofthe deep learning model. The first input was given to the convolutional layer of size 3, consistingof 128 filters, followed by a globalmax pooling layer.The convolutional layer systematically apply learned filters to the input data so as to create feature maps that summarize the presence of the strong feature set in the input data. The global max-pooling layer will down sample each feature map into a single value which is the maximum value of the patches of the feature set [20]. In this way the problems due to overfitting of the fully connected layers can be minimized. Subsequently, there are two dense layers. All layers except the final dense layer is with ReLU activation function, whereas the final decision making layer has softmax as the activation function. The model has 96,796 parameters to be learned during the training. Typically, if there are n words in a sentence [21], it can be tokenized as an integer vector T, where 1xd dimension, d denotes the word length. By applying the word embeddings eachtokens will be vectorized consisting of the feature representations of the required transcriptions. Itis given as embeddings, where We is the parameters to be tuned and .The hidden layer output is represented as where t  is the weights and bias parameters and the final activation layer is the softmax layer [22]. For a given class hi the softmax function is represented as: where hj are the values inferred by the net for each class in C.

Bimodal Framework
In the proposed model, individual, parallel networks were trained initially. Later, the intermittentlayers of both these networks are extracted as feature input for the bimodal framework. Inthis way, the complementary information from both the modalities could be taken effectively. The3 rd layer of the textual model and 6th layer of the audio model are optimally selected and extractedas features for the final fusion model. The global maxpooling layer in text modality significantlyreduced the size of the feature map and the same was done in audio modality thorough downsamplingthe dense layer. Features from these two layers are concatenated and it is applied as inputto the third combined model. The feature sets are applied directly without any pre-processing.This model is also a deep neural network consisting of convolutional layers and max-pooling layers.The output from the model will classify the utterances as positive or negative polarity. Thedecision vector formed by combining the text and audio modalities are improving the performanceof sentimental analysis considerably compared to individual modalities alone. The final decisionon sentiment classification is taken based on the softmax activation function.
The experiments are conducted on MOUD dataset both on individual and combined modalities. During the training phase of the proposed model, the weights are adjusted to minimize the loss function. A faster convergence of thee model is achieved by selecting a proper learning rate as in: where η0 represents the learning rate of the gradient descent algorithm. The optimizer algorithm we used for comparison purpose are stochastic gradient descent algorithm(SGD), Root Mean Squared prop(RMSProp) and ADAptive Moment estimation(ADAM). The SGD does the parameter updates for all the training examples in the training set with a prefixed learning rate [23]. In the RMSProp algorithm proposed by Geoffrey Hinton, instead of letting all of the gradients to accumulate the value of its momentum, RMSProp algorithm only accumulates gradients in a fixed window. Adam optimizer computes adaptive learning rates for each parameter considered in the algorithm and it stores the exponentially decaying average of the square of the gradients of the previous values [24].

The MOUD Dataset
The Multimodal Utterance Opinion Database (MOUD) introduced by Perez et. al. [13] is an opinionated dataset in Spanish language. It consists of product review and recommendationsin utterance level from 80 speakers collected through YouTube videos. From the available 498videos we selected 438 recordings for our work, which showed consistency among speech andtext modalities and on an average, each one of the video has 6 utterances of 5 seconds durationwith a standard deviation of 1.2 seconds. The contents of each one of the video clips weretranscribed through manually processing the verbal statements for its connotations.Annotations of the dataset was done using Elan tool for sentiment analysis. Both audioand video modes are annotated using the tool. Two annotators independently annotated thepolarity of the utterances as positive, negative or neutral. In our classification problem, positiveand negative sentiments were only considered.

RESULTS AND ANALYSIS
The objective is to classify the sentiments in the videos based on the polarity as positive or negative through analyzing the MOUD dataset. A combined audio and text model was developed by implementing deep neural networks. The dataset was optimally divided into a train-test ratio of 80:20 for developing the model and testing the data.The categorical cross entropy, which is a combination of softmax and cross entropy function was taken as the loss function for training the model. The unimodal features are applied to the two parallel subnets and the outputs of from intermediate hidden layers are optimally selected. These selected values are fused to get and the same will be acting as the input to the final subnet. Several experiments were conducted before fixing the proposed architecture. We compiled the model with different hyper-parameters also. There was some significant changes based on the optimizer selection. The minibatches can offer the effect of regularization. The minibatch selected was 32 for the proposed model. There were significant changes in the performance of the model based on the optimizer selection. The output of the proposed system is one hot encoded. The results of the experiments are tabulated in Table 1. The performance of the proposed model was evaluated using different performance matrices viz, accuracy, precision, recall and F1-score. The graphical representation of accuracy on training each epoch is shown in Figures 2-10. The effects of different optimizers are highlighted here. The ADAM optimizer is showing the optimal results. The SGD is giving fluctuations during convergence. The parameters selected for SGD are as learning rate=0.001 and momentum = 0.9. In the case of RMSProp algorithm, the values are also the same. For Adam optimizer, in addition to the above values, the decay rates are also fixed as 0.999.  Further, we compared the performance of our algorithm with some of the existing algorithms and the superiority of the combined audio and text proposed architecture is quiet evident from Table 2. The proposed model was compared with four of the existing state of the art methods. Poria et. al., [25] proposed a speaker exclusive technique for analyzing the sentiments embedded in theutterances. Wang et. al., [26] proposed to mitigate the problem of generalizabilty to a larger margin. Poria et. al., [27] proposed the Convolutional Recurrent Multi Kernal Learning (CRMKL) model using CNN networks exclusively for training the model, and the combined network takes the best features only by Principal Component Anlysis (PCA) and they used SVM for the decision making.

TELKOMNIKATelecommun Comput El Control
The results of the experiments were tabulated in Table 2. It shows the test results with and without feature selection. Our results and the results obtained by Poria [28] presented a deep learning architecture focusing on speaker independent systems. Our method performed much better than this proposed work. Tsai, et. al., [29] proposed a multimodal factorized modal (MFM) with multimodal discriminative and modalityspecific generative factors. Sentiment prediction results on MOUD are depicted in Table 2. The best results are highlighted inbold and SOTA shows the changes in performance over previous state of the art (SOTA) results. The improvements are highlighted in bold in Table 3.

CONCLUSION
An analysis of the existing methodologies for sentiment analysis and the comparison with the proposed bimodal sentiment analysis system is carried out here. The proposed framework establishes the superiority of bimodal approaches over unimodal approaches. We are incorporating the the powerful CNN based deep learning techniques for the test case. The intermediate level feature fusion method is adapted here. The sequential and correlated information is collected through word embeddings in the textual data and audio feature extractions. For further analysis, so as to increase the accuracy of the performance of the model non verbal communications like jesters and images can be incorporated. The multi modal approach can integrate all the information related to the communication, which in turn can make the human computer interactions more realistic and meaningful.