Dataset Open Access

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

Chakravarthi, Bharathi Raja; Priyadharshini, Ruba; Muralidaran, Vigneshwaran; Jose, Navya; Suryawanshi, Shardul; McCrae, John P.


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Tamil, Malayalam, Kannada, Dravidian languages, Sentiment Analysis, Offensive langauge identification, Code-mixed, corpora</subfield>
  </datafield>
  <controlfield tag="005">20210512134815.0</controlfield>
  <controlfield tag="001">4750858</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">ULTRA Arts and Science College, Madurai, Tamil Nadu, India</subfield>
    <subfield code="a">Priyadharshini, Ruba</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Cardiff University, United Kingdom</subfield>
    <subfield code="a">Muralidaran, Vigneshwaran</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Indian Institute of Information Technology and Management-Kerala, Kerala, India</subfield>
    <subfield code="a">Jose, Navya</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">National University of Ireland Galway</subfield>
    <subfield code="a">Suryawanshi, Shardul</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">National University of Ireland Galway</subfield>
    <subfield code="a">McCrae, John P.</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">10793728</subfield>
    <subfield code="z">md5:7850be52919a387f5b36c7a09b05ad87</subfield>
    <subfield code="u">https://zenodo.org/record/4750858/files/DravidianCodeMix-2020.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-05-12</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:4750858</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">National University of Ireland Galway</subfield>
    <subfield code="0">(orcid)0000-0002-4575-7934</subfield>
    <subfield code="a">Chakravarthi, Bharathi Raja</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff&amp;#39;s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country.&amp;nbsp; We also present baseline experiments to establish benchmarks on the dataset using machine learning methods.&lt;/p&gt;

&lt;p&gt;If you are using the data or code from this research then please site our paper below:&lt;/p&gt;

&lt;p&gt;@article{chakravarthi-etal-2021-lre,&lt;br&gt;
title = &amp;quot;DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text&amp;quot;,&lt;br&gt;
author = &amp;quot;Chakravarthi, Bharathi Raja&amp;nbsp; and&lt;br&gt;
&amp;nbsp; Priyadharshini, Ruba&amp;nbsp; and&lt;br&gt;
&amp;nbsp; Muralidaran, Vigneshwaran and&lt;br&gt;
&amp;nbsp; Jose, Navya and&lt;br&gt;
&amp;nbsp; Suryawanshi, Shardul and&lt;br&gt;
&amp;nbsp; Sherly, Elizabeth&amp;nbsp; and&lt;br&gt;
&amp;nbsp; McCrae, John P&lt;br&gt;
&amp;nbsp; journal={Language Resources and Evaluation},&lt;br&gt;
&amp;nbsp; year={2021},&lt;br&gt;
&amp;nbsp; publisher={Springer}&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4750857</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4750858</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
142
10
views
downloads
All versions This version
Views 142142
Downloads 1010
Data volume 107.9 MB107.9 MB
Unique views 118118
Unique downloads 1010

Share

Cite as