Journal article Open Access

Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia

Rawat, Charu; Sarkar, Arnab; Singh, Sameer; Alvarado, Rafael; Rasberry, Lane


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nam##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">eng</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Wikipedia, machine learning, misconduct, harassment, community moderation, Natural Language Processing</subfield>
  </datafield>
  <controlfield tag="005">20191114190947.0</controlfield>
  <controlfield tag="001">3101511</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Virginia</subfield>
    <subfield code="a">Sarkar, Arnab</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Virginia</subfield>
    <subfield code="a">Singh, Sameer</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Virginia</subfield>
    <subfield code="a">Alvarado, Rafael</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Virginia</subfield>
    <subfield code="0">(orcid)0000-0002-9485-6146</subfield>
    <subfield code="a">Rasberry, Lane</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">416283</subfield>
    <subfield code="z">md5:a17cc1eadcd0183a7673ef93277da7dc</subfield>
    <subfield code="u">https://zenodo.org/record/3101511/files/Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia.pdf</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-05-21</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="o">oai:zenodo.org:3101511</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">University of Virginia</subfield>
    <subfield code="a">Rawat, Charu</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">http://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Today&amp;rsquo;s digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it&amp;rsquo;s large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia&amp;rsquo;s XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ~84%.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3101510</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3101511</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">publication</subfield>
    <subfield code="b">article</subfield>
  </datafield>
</record>
104
36
views
downloads
All versions This version
Views 104105
Downloads 3636
Data volume 15.0 MB15.0 MB
Unique views 9394
Unique downloads 3434

Share

Cite as