EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

doi:10.5281/zenodo.8014709

Published June 7, 2023 | Version 1.0.0

Dataset Open

EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

1. CrowdStrike

In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER — one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method.

Files

ember_with_avclass_dataset.csv

Files (2.9 GB)

Name	Size	Download all
avclass_tag_co_occurrence.alias md5:45f30f262024e41284e82d4df06ca5cb	3.3 MB	Download
ember_with_avclass_dataset.csv md5:587618183c0db169a0c7ba7974d0581e	149.0 MB	Preview Download
README.txt md5:898ea63596da15ced77dd034652ceba5	519 Bytes	Preview Download
sim_test_vs_train_test.csv md5:9f59518409fa5c16f2f416e98ce3a7c7	1.4 GB	Preview Download
sim_unlabelled_vs_train.csv md5:3c8ab6792e07608dc8f8bead2849e580	1.4 GB	Preview Download

Additional details

Anderson, H. S., & Roth, P. (2018). Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637.
Sebastián, S., & Caballero, J. (2020, December). Avclass2: Massive malware tag extraction from av labels. In Annual Computer Security Applications Conference (pp. 42-53).
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785-794.

	All versions	This version
Views	274	271
Downloads	204	201
Data volume	109.1 GB	104.6 GB

EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

Creators

Description

Files

ember_with_avclass_dataset.csv

Files (2.9 GB)

Additional details

References