Dataset Open Access

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Chakravarthi, Bharathi Raja; Jose, Navya; Suryawanshi,Shardul; Shely, Elizabeth; McCrae, John Philip

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

https://www.aclweb.org/anthology/2020.sltu-1.25/
Files (498.0 kB)
Name Size
Malayalam_first_ready_for_sentiment.tsv
md5:c72a274b712ff385e880a2ca525c4b3f
498.0 kB Download
288
110
views
downloads
All versions This version
Views 288286
Downloads 110110
Data volume 54.8 MB54.8 MB
Unique views 261259
Unique downloads 107107

Share

Cite as