Fine grained irony classification through transfer learning approach

ABSTRACT


INTRODUCTION
Irony has indeed been demonstrated to be ubiquitous in social media, offering major challenge to sentiment analysis field [1].It is a cognitive phenomenon in which affect-related features play a significant role.In data driven world data is increasing exponentially day by day [2].Machine learning and deep learning algorithms are playing a vital role in massive data analysis and knowledge extraction.The broad use of creative and metaphorical expressions like irony and sarcasm is common in user-generated content on social media sites like Twitter and Facebook [3].Irony is the use of language that traditionally means the contrary to express one's meaning, usually for amusing or emphatic effect.Despite their significant distinctions in connotation, the phrases sarcasm and irony are frequently interchanged.The precision of irony identification is crucial in marketing research.Because irony usually causes polarity inversion, failing to acknowledge it may result in poor sentiment classification findings [4].Intelligence services must be able to identify irony in order to separate perceived risks from ironic statements.Irony identification is indeed a complex problem especially relative to most natural language processing (NLP) tasks.Irony manifests itself in the form of polarized feeling, which is common on Twitter.For example "I really love this year's summer; weeks and weeks of awful weather".In this example, irony results from a polarity inversion between two evaluations, the literal evaluation "I really love this year's summer" is positive, while the intended one, which is implied by the context "weeks and weeks of awful weather", is negative.

RELATED WORK
Transformers have the ability to learn longer-term dependence, but in the context of language modelling, they are constrained by a fixed-length context.Transformer-XL (extra long), which extends the length of learning dependency without interfering with sequential coherence [5].Bidirectional encoder representations from transformers (BERT) is intended to train deep bidirectional representations from unlabeled text by reinforcing on both left and right context simultaneously in all levels [6].Dai and Le presents two ways for improving sequence learning using recurrent networks that employ unlabeled text input.The first method is to anticipate what will happen next in a series.The second method is to utilize a sequence auto encoder, to scans the supplied sequence and to predicts it again [7].Howard and Ruder proposes an efficient transfer learning approach that may be used to in any NLP tasks [8].Provided a deep bidirectional language model that has been pertained on huge text corpora and will be put directly on top of the current model, considerably improving performance in subsequent NLP tasks [9].Suggestion data mining is an emerging and demanding topic of NLP that aims to follow user recommendations on web forums [10].Proposed 8×8 encoder and decoder layer with an attention mechanism that aids in parallel processing and reduces training time.The attention mechanism aids in paying special attention to each word and its position in the sentence [11].Xie et al. offer an n-dimensional linkage approach for incorporating aspect relationships into deep neural networks for aspect value estimation [12].

PROPOSED METHODOLOGY
The Proposed bidirectional long-short term memory-DistilBERT (BiLSTM-DistilBERT) framework consist a stack of layers like sentence embedding, transformer, BiLSTM [13], concatenate, pooling and finally softmax classification layers [14], [15], as shown in Figure 1.The sentence is represent as  = {  1 ,  2 ,  3 …   } is embedded into the pre-trained DistilBERT transformer layer followed by BiLSTM recurrent neural network [16].Pooling mechanism is used to the representation of concatenated tensor value of DistilBERT and BiLSTM outputs and finally routed through a fully connected softmax-layer.additional BiLSTM recurrent neural network to extend the model [17], [18].This is effective because the pre-trained model's weights contain information representing a high-level understanding of the English language, so we can build on that general knowledge by adding additional layers whose weights will come to represent task-specific understanding of what makes a tweet irony or non-irony [19], [20].The Hugging Face Transformers library makes transfer learning very approachable, as our general workflow can be divided into four main stages, namely input embedding, defining a model architecture, training classification layer Weights, fine-tuning DistilBERT and training all weights.Hugging Face application programming interface makes it extremely easy to convert words and sentences into sequence of tokens and these tokens are get converted into tensors by the text vectorization class, finally these tensors are fed into our model.Once we instantiate our tokenizer object, we can then go about encoding our training, validation, and test sets in batches using the tokenizer's.batch_encode_plus() method.Important arguments set as part of training are max_length to controls the maximum number of words to tokenize in a given text.Padding or truncation to adjust input according to max_length.Attention mask help the model to decide on which tokens to pay more attention and what all need to ignore thus, including the attention mask as an input to our model helps us to increase the model performance.As pertained model is extended by BiLSTM recurrent neural network units are capable to capture the long range dependencies among the tokens through this proposed model can learn the semantics of each inputs with respect to the specific task.The output of LSTM units get concatenated and passed through a feedforward network with maximum kernel size followed by pooling layer and as output softmax layer uses the softmax function to squash the vector of arbitrary real-valued scores.

RESULT AND DISCUSSION
The Proposed model is used Keras [21], is an open-source software library that provides a Python interface for artificial neural networks.Keras acts as an interface for the Tensor Flow library.In binary classification challenge, tweets are classified as irony or not irony.For binary classification, we trained model with 30 epochs, Adam optimizer and sparse categorical cross entropy loss function [22].

DataSet
SemEval-2018 task 3: irony detection in English tweets, shared task on irony detection [23]: given a tweet, automatic NLP systems should determine whether the tweet is ironic (task A) and which type of irony (if any) is expressed (task B).The ironic tweets were collected using irony-related hashtags (i.e.#irony, #sarcasm, #not) and were subsequently manually annotated to minimize the amount of noise in the corpus.For both tasks, a training corpus of 3,834 tweets was provided, as well as a test set containing 784 tweets.Table 1 represents the irony and not irony samples for binary classification task.Table 2 shows the dataset splitting ration for training, validation and testing for multi-class classification task.

Experimental results
SemEval 2018 irony dataset has only training and testing samples.So, we have divided training samples into training and validation set in ratio of 80:20 [24].Table 2 shows dataset splitting ration for training, validation and testing phases.We have achieved maximum training accuracy of 98% and validation accuracy of 69%.On testing samples, we have achieved precision of 81% for not irony class and 66% for irony class, recall of 77% for not irony and 72% for irony and 79% F1 score for not irony and 69% irony class.Table 3 shows precision, recall and F1 score for testing samples for binary classification.Our model is performing better in classifying not irony tweets as compare to irony class.Figure 2 shows accuracy and loss for training and validation dataset during training, as shown in Figure 2 training loss is always high compare to validation loss.Figure 3 shows confusion matrix for binary classification.Total of 168 samples of not irony class are classified as irony class and total of 108 samples visa-versa.Figure 4 shows area under the curve (AUC) and receiver operating characteristics (ROC) curve for irony binary classification.AUC-ROC curve shows performance of classification model under various threshold settings.AUC of our model is 0.72.In multi-class classification challenge, tweets are classified as three irony categories namely, clash irony, situational irony, and others and irony.Since multi-class dataset is derived from binary classification challenge by categorizing irony tweets, multi-class dataset is not balanced.For multi-class classification, we trained model with 30 epochs, Adam optimizer and sparse categorical cross entropy loss function.
Table 4, shows testing phase results multi-class classification, as shows in the Table 4, proposed model achieves the F1 score of 84% for not irony, 18% for clash irony and 12% for situational.Figure 5, shows accuracy and loss for training and validation dataset during training.We have achieved maximum training accuracy of 87% and validation accuracy of 66%.On testing samples, we have achieved precision of 73% for not irony class, 57% for clash irony class, 80% for situational irony class, recall of 99% for not irony and 10% for clash irony, 6% for situational irony and 84% F1 score for not irony and 18% clash irony and 12% for situational irony.Our model is performing better in classifying not irony tweets as compare to different irony classes.Other type of irony class tweets not properly classified.Figure 6 shows confusion matrix for multi class irony classification.Proposed model is classified total of 553 samples as not irony class, 111 samples as clash irony, 56 samples as situational irony and 36 samples are as otherirony.

CONCLUSION AND FUTURE SCOPE
In this research work we proposed a BiLSTM-DistilBERT hybrid neaural network model to address fine-grained irony classification task on SemEval-2018 task 3 dataset.Transformers are used to minimize the data preprocessing and feature extraction tasks.Through transfer learning approach our proposed BiLSTM-DistilBERT model achieves state-of-the-art results over the SemEval-2018 task 3 dataset.Also in future, instead of DistilBERT transformers other type of transformers could be used to extract features from tweets.Also classification models such as basic supervised machine learning algorithm support vector machine could be used.

Figure 2 .Figure 3 .
Figure 2. Training and Validation loss and accuracy for 30 epochs in Binary classification

Figure 5 .Figure 6 .
Figure 5. Loss and accuracy for training and validation samples during training phase of multi class irony classification

Table 1 .
Binary classification dataset splitting ration for training, validation and testing

Table 2 .
Multi-class classification dataset splitting ration for training, validation and testing

Table 3 .
Precision, recall and F1 score for testing samples for binary classification

Table 4 .
Testing phase results for different phases for all four classes