# Brief description of the directory
The usage of LeetSpeak [1] and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:
- YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/. 
- a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.
- CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.
- TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

The postprocessed datasets are publicly shared for the scientific community in separate folders. Each folder included in this directory contains 4 transformations of each processed dataset.
- baseline: This is a copy of the original dataset
- cleaned: This is the original dataset in which re removed special symbols
- obfuscated: This version contains the original dataset obfuscated by using LeetSpeak transformations

Please check the license agreement for each of the datasets in order to avoid legal issues.
The original dataset can be found in the Mondragon Unibertsitatea Repository -- https://gitlab.danz.eus/datasharing/ski4spam

# Bibliography
[1] Flamand, Eveline, and Anne-Marie Simon-Vandenbergen: Deciphering L33t5p34k: Internet Slang On Message Boards. 2008. Available at https://lib.ugent.be/catalog/rug01:001414289

[2] Alberto, T.C., Lochter J.V., Almeida, T.A.: Filtragem Automática de Spam nos Comentários do YouTube. Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC'15), Natal, RN, Brazil, 2015.

[3] Alberto, T.C., Lochter, J.V., Almeida, T.A.: TubeSpam: Comment Spam Filtering on YouTube.  Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), pp. 138–143, Miami, FL, USA, December, 2015
 
[4] O'Callaghan, D., Harrigan, M., Carthy, J., Cunningham, P.: Network Analysis of Recurring YouTube Spam Campaigns. ArXiv, arXiv/1201.3783, 2012.

[5] O'Callaghan, D., Harrigan, M., Carthy, J., Cunningham, P.: Identifying Discriminating Network Motifs in YouTube Spam. ArXiv, arXiv/1202.5216, 2012.

# Authors
The obfuscated texts datasets has been conceived and developed by The Data Analysis and Cybersecurity research team of Mondragon Unibertsitatea. https://www.mondragon.edu/en/research-transfer/engineering-technology/research-and-transfer-groups/-/mu-inv-mapping/grupo/analisis-de-datos-y-ciberseguridad
