Set of obfuscated spam dataset by using LeetSpeak transformations

Iñaki Velez de Mendizabal; Xabier Vidriales; Vitor Basto Fernandes; Enaitz Ezpeleta; José Ramón Méndez; Urko Zurutuza

doi:10.5281/zenodo.6373653

Published March 21, 2022 | Version v1

Dataset Open

Set of obfuscated spam dataset by using LeetSpeak transformations

1. Mondragon Unibertsitatea
2. Instituto Universitário de Lisboa (ISCTE-IUL)
3. University of Vigo

The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:

YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.
a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.
CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.
TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

Files

corpora.zip

Files (42.5 MB)

Name	Size	Download all
corpora.zip md5:e343d92e9cb2deebf2ffc795cfc3c8d0	42.5 MB	Preview Download

680

Views

Downloads

Show more details

	All versions	This version
Views	680	677
Downloads	96	96
Data volume	4.8 GB	4.8 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 21, 2022
Modified: March 22, 2022

Set of obfuscated spam dataset by using LeetSpeak transformations

Authors/Creators

Description

Files

corpora.zip

Files (42.5 MB)