Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

Bharathi Raja Chakravarthi; Vigneshwaran Muralidaran; Ruba Priyadharshini; John Philip McCrae

doi:10.5281/zenodo.3842641

Published May 11, 2020 | Version v1

Conference paper Open

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

1. National University of Ireland Galway
2. School of English, Communication and Philosophy, Cardiff University
3. Saraswathi Narayanan College

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Files

chakravarthi2020corpus.pdf

Files (347.0 kB)

Name	Size	Download all
chakravarthi2020corpus.pdf md5:4146dd6ab97729230b425f5c49b253df	347.0 kB	Preview Download

Additional details

European Commission
ELEXIS - European Lexicographic Infrastructure 731015
European Commission
Pret-a-LLOD - Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors 825182

264

Views

168

Downloads

Show more details

	All versions	This version
Views	264	263
Downloads	168	168
Data volume	60.7 MB	60.7 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Conference

1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) Workshop at LREC 2020

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 25, 2020
Modified: July 19, 2024

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

Authors/Creators

Description

Files

chakravarthi2020corpus.pdf

Files (347.0 kB)

Additional details

Funding