Dataset Open Access

Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

Antonis Papasavva; Savvas Zannettou; Emiliano De Cristofaro; Gianluca Stringhini; Jeremy Blackburn

MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="">
  <controlfield tag="005">20200527170802.0</controlfield>
  <controlfield tag="001">3606810</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Max Planck Institute</subfield>
    <subfield code="a">Savvas Zannettou</subfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University College London</subfield>
    <subfield code="a">Emiliano De Cristofaro</subfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Boston University</subfield>
    <subfield code="a">Gianluca Stringhini</subfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Binghamton University</subfield>
    <subfield code="a">Jeremy Blackburn</subfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">24039475594</subfield>
    <subfield code="z">md5:3ad65640bf590d77af0f931045aef2e0</subfield>
    <subfield code="u"></subfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">231605</subfield>
    <subfield code="z">md5:b3394d4a5a1bd254ef6a6afca2db0270</subfield>
    <subfield code="u"></subfield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2020-01-13</subfield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o"></subfield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">University College London</subfield>
    <subfield code="a">Antonis Papasavva</subfield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board</subfield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u"></subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2"></subfield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This is the dataset released with the &lt;a href=""&gt;paper&lt;/a&gt; titled:&amp;nbsp;&amp;quot;Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board&amp;quot;.&lt;/p&gt;

&lt;p&gt;The dataset is a single&amp;nbsp;&lt;a href=""&gt;Newline delimited JSON file&lt;/a&gt;. Each line in the file consists of a JSON object which is a full 4chan /pol/ thread.&amp;nbsp;The JSON objects contain&amp;nbsp;all the&amp;nbsp;key/values returned by the &lt;a href=""&gt;4chan API&lt;/a&gt;, along with three additional keys&amp;nbsp;(&lt;em&gt;entities,&amp;nbsp;perspectives&lt;/em&gt;, and &lt;em&gt;extracted_poster_id&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;For each JSON object&amp;nbsp;we&amp;nbsp;complement the data with the list of the named entities we&amp;nbsp;detect for each post, using the &lt;a href=""&gt;spaCy &lt;/a&gt;Python library. In addition, for each post we add scores returned by the&amp;nbsp;Google&amp;rsquo;s &lt;a href=""&gt;Perspective API&lt;/a&gt;, and more&amp;nbsp;specifically seven scores in the [0; 1] interval.&lt;/p&gt;

&lt;p&gt;For the detailed description of every &lt;em&gt;key &lt;/em&gt;in the JSON structure, along with the type of the &lt;em&gt;value&lt;/em&gt;, please read the readme.pdf file provided with this dataset.&lt;/p&gt;

&lt;p&gt;If you find our dataset useful, please cite our paper:&lt;/p&gt;

  title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board},
  author={Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn},
  journal={14th International AAAI Conference On Web And Social Media (ICWSM), 2020},


&lt;p&gt;&lt;strong&gt;How to extract the data:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Note that the data is compressed. See the instructions below on how to extract the data:&lt;/p&gt;

	&lt;li&gt;&lt;strong&gt;Linux and Mac&lt;/strong&gt;&lt;/li&gt;

&lt;p&gt;Step 1: Open a terminal window and navigate to the path where the file&amp;nbsp;&lt;em&gt;pol_0616-1119_labeled.tar.zst&amp;nbsp;&lt;/em&gt;is located.&lt;/p&gt;

&lt;p&gt;Step2: Run the following command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;unzstd pol_0616-1119_labeled.tar.zst&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The above command will result in a file named&amp;nbsp;&lt;em&gt;pol_0616-1119_labeled.tar. &lt;/em&gt;(in the same directory)&lt;/p&gt;

&lt;p&gt;Step 3: Again, from your terminal window, run this command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tar -xvf pol_0616-1119_labeled.tar&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When the above command finishes, you will get (in the same directory) the extracted data - a file named&amp;nbsp;&lt;em&gt;pol_062016-112019_labeled.ndjson.&lt;/em&gt;&lt;/p&gt;


&lt;p&gt;There are many applications that can be used to extract this data on Windows available online. The authors cannot recommend specific applications. Note that the file is compressed twice so you will need to perform the data extraction twice - once on the downloaded file, and once on the file that was extracted from the downloaded file.&lt;/p&gt;


&lt;p&gt;Please do not hesitate to contact the author of this study in case you face any problem at:;/p&gt;</subfield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3606809</subfield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3606810</subfield>
    <subfield code="2">doi</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
All versions This version
Views 26,73826,738
Downloads 24,70024,700
Data volume 126.9 TB126.9 TB
Unique views 24,27324,273
Unique downloads 19,88019,880


Cite as