There is a newer version of this record available.

Dataset Closed Access

Dataset for "On the Origins of Memes by Means of Fringe Web Communities"

Savvas Zannettou; Tristan Caulfield; Jeremy Blackburn; Emiliano De Cristofaro; Michael Sirivianos; Gianluca Stringhini; Guillermo Suarez-Tangil

This dataset is obsolete because the Twitter pHashes were incompatible with all the other pHashes in the dataset due to a version mismatch of the ImageHash Python library.

Please download the updated dataset here:


This dataset was collected with research funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No 691025.
The publication on which this dataset was used is: "On the Origins of Memes by Means of Fringe Web Communities". Savvas Zannettou, Tristan Caulfield, Jeremy Blackburn, Emiliano De Cristofaro, Michael Sirivianos, Gianluca Stringhini, and Guillermo Suarez-Tangil. ACM Internet Measurement Conference (IMC), 2018., DOI:


The dataset consists of all the URLs and phashes for images from Twitter, Reddit, 4chan's /pol/, and Gab posted between July 2016 and end of July 2017.

The code related to this research can be found here:, or here: 10.5281/zenodo.1463050

Presentation available here:

Closed Access

Files are not publicly accessible.

All versions This version
Views 1,142744
Downloads 211128
Data volume 1.1 TB676.7 GB
Unique views 949679
Unique downloads 16994


Cite as