Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

Sentiment analysis of Dravidian languages has received attention in recent years. However, most social media text is code-mixed and there is no research available on sentiment analysis of code-mixed Dravidian languages. The Dravidian-CodeMix-FIRE 2020, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem. There were two languages for this track: (i) Tamil, and (ii) Malayalam. The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language. The performance of the systems was evaluated by weighted-F1 score.


INTRODUCTION
Sentiment analysis is the task of identifying subjective opinions or responses about a given topic. It has been an active area of research in the past two decades in both academia and industry. There is an increasing demand for sentiment analysis on social media texts which are largely code-mixed. Code-mixing is a prevalent phenomenon in a multilingual community where the words, morphemes and phrases from two or more languages are mixed in speech or writing. Code-mixed texts are often written in non-native scripts particularly on social media. Systems trained on monolingual data fail on code-mixed data due to the complexity of code-switching at different linguistic levels in the text. This shared task presents a new gold standard corpus for sentiment analysis of code-mixed text in Dravidian languages (Tamil-English and Malayalam-English).
Tamil is one of the Dravidian languages spoken by Tamil people in India, Sri Lanka and by the Tamil diaspora around the world, with official recognition in India, Sri Lanka and Singapore. Malayalam is another Dravidian language spoken in the southern region of India with official recognition in the Indian state of Kerala and  the Union Territories of Lakshadweep and Puducherry. There are  nearly 75 million Tamil speakers 1 and 45 million Malayalam speakers 2 in India and other countries. Tamil and Malayalam are highly  agglutinative languages. Tamil script evolved from the Tamili script 3 , Vatteluttu alphabet, and Chola-Pallava script. The modern Tamil script descended from the Chola-Pallava script. It has 12 vowels, 18 consonants, and 1āytam (voiceless velar fricative). Minority languages such as Saurashtra, Badaga, Irula, and Paniya are also written in the Tamil script. The Malayalam script is the Vatteluttu alphabet extended with symbols from the Grantha script. Both Tamil and Malayalam scripts are alpha-syllabic, belonging to a family of the abugida writing systems that are partially alphabetic and partially syllablebased. However, social media users often adopt Roman script for typing because it is easy to input. Hence, the majority of the data available in social media for these under-resourced languages are code-mixed.
The goal of this task is to identify the sentiment polarity of the code-mixed dataset of comments/posts in Dravidian Languages (Malayalam-English and Tamil-English) collected from social media. The comment/post may contain more than one sentence but the average sentence length of the corpora is 1. Each comment/post is annotated with sentiment polarity at the comment/post level. This dataset also has class imbalance problems which depicts the real-world scenarios. The dataset provided for training and development contains 11,335 and 1,260 sentences for Tamil, 4,851 and 541 sentences for Malayalam. More details about the annotation of the dataset can be found in [6] and [5].
Our shared task aimed to encourage research that will reveal how sentiment is expressed in code-mixed scenarios on Dravidian social media text. The participants were provided with development, training and test datasets.

TASK DESCRIPTION
This is a message-level polarity classification task. Given a YouTube comment, the goal of the systems submitted to the shared task was to classify the comment into positive, negative, neutral, mixed feeling classes or recognise if the comment is not in the intended languages. The dataset contains all the three types of code-mixed sentences -Inter-Sentential switch, Intra-Sentential switch and Tag switching. All comments were written in Roman script with either the grammar of native language with English lexicon or English grammar with native lexicon.

EVALUATION
The distribution of the sentiment classes are imbalanced in both the datasets. In Malayalam-English code-mixed dataset, we have a class imbalance with the majority of comments belonging to positive (2,811) and neutral (1,903) classes. Similarly, Tamil-English codemixed dataset has the class imbalance with Positive (10,559), Negative (2,037) and Mixed feelings (1,801) being majority classes. This imbalance demands to be addressed. Hence, we chose a weighted average F1-score to rank the system submission. The weighted average F1-score is calculated by averaging the support-weighted mean per-class F1 scores (i.e. weights on class distribution). This takes into account the varying degrees of importance of each class in the

RESULTS
Overall, 119 participants registered for this track. 32 teams submitted final results for Tamil and 28 teams submitted results for Malayalam. Table 1 and Table 2 shows the rank list of Tamil and Malayalam task respectively. The runs are sorted in decreasing order of the weighted F1scores. The best performing runs achieved weighted F1-score of 0.65 and 0.74 for Tamil and Malayalam respectively. These scores are relatively low F1-scores compared to monolingual sentiment analysis results in high-resourced languages such as English. This reflects the complexity of code-mixing and the class imbalance problem observed in the real-world setting. The top team "SRJ" used XLM-Roberta and CNN to propose a new model to extract semantic information.

CONCLUSION
This paper overviews the first shared task on sentiment analysis in code-mixed Dravidian text from social media that aims at classifying YouTube comments. A hundred and nineteen teams participated in the task, and a total of 32 teams for Tamil and 28 teams Malayalam submitted the results. Systems have been trained on the unbalanced dataset. The methods proposed by participants ranged from traditional machine learning models with features based approaches to using state-of-the-art embedding methods in deep learning models.