Published March 21, 2022
| Version v1
Dataset
Open
Set of obfuscated spam dataset by using LeetSpeak transformations
Creators
- 1. Mondragon Unibertsitatea
- 2. Instituto Universitário de Lisboa (ISCTE-IUL)
- 3. University of Vigo
Description
The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:
- YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.
- a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.
- CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.
- TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/
Files
corpora.zip
Files
(42.5 MB)
Name | Size | Download all |
---|---|---|
md5:e343d92e9cb2deebf2ffc795cfc3c8d0
|
42.5 MB | Preview Download |