Published December 20, 2024 | Version v1
Dataset Open

MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification

  • 1. ROR icon Ewha Womans University
  • 2. ROR icon The University of Texas at El Paso

Description

These are the two datasets -- EMBER Class and AZ Class to reproduce the results of the paper ``MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification", accepted to be published at the The 39th Annual AAAI Conference on Artificial Intelligence (AAAI) 2025.

  • EMBER 2018 dataset
    We use the 2018 EMBER dataset, known for its challenging classification tasks, focusing on a subset of 337,035 malicious Windows PE files labeled by the top 100 malware families, each with over 400 samples. Features include file size, PE and COFF header details, DLL characteristics, imported and exported functions, and properties like size and entropy, all computed using the feature hashing trick.

  • AZ-Class
    The AZ-Class dataset contains 285,582 samples from 100 Android malware families, each with at least 200 samples. We extracted Drebin features (Arp et al.2014) from the apps, covering eight categories like hardware access, permissions, API calls, and network addresses.

Files

Files (6.0 GB)

Name Size Download all
md5:1e4014ff6eb613845a6bcf38a2461001
280.5 MB Download
md5:644649dfb93f0f0086052a923b657833
2.5 GB Download
md5:9a844162d5ccca20987ce520d6355f33
321.1 MB Download
md5:27e39d1cb697434107f92a8084128734
2.9 GB Download

Additional details

Dates

Accepted
2024-12-20

Software

Repository URL
https://github.com/MalwareReplayGAN/MalCL
Programming language
Python