Published October 31, 2023 | Version V1.0
Dataset Open

MaDroid: A Maliciousness-aware Multifeatured Dataset for Detecting Android Malware

  • 1. dguoyun@hnu.edu.cn
  • 2. haopengliu@hnu.edu.cn
  • 3. caiminjie@hnu.edu.cn
  • 4. jhsun@hnu.edu.cn
  • 5. haochen@hnu.edu.cn

Description

MaDroid is a maliciousness-aware multifeatured dataset of system calls focused on APK anomaly detection. The dataset includes 50,429 well-marked normal and abnormal system call sequences, with 24,789 and 25,640 sets of normal and abnormal sequences, respectively, for a total of 1.1 billion system call feature information. Each APK is labeled with the latest VT checksum information, and the sequence data includes 81 groups of system calls, system call parameters, and return values. The size of the whole dataset is 457 GB (19 GB after compression), including 236 GB of malicious system call sequence data. The APKs from which the system call feature sequences are derived cover mobile apps of different types released at different times in the past 14 years (2010-2023), covering 10 mainstream app markets, including Google Play, PlayDrone, Anzhi, etc. The APKs are also used as the source of the system call feature sequences, and the system call sequence data is used as the source of the APKs. anzhi, etc. We store the source code and dataset in two open platforms, GitHub and Zenodo, respectively.

DataSet

Release address: http://doi.org/10.5281/zenodo.7997398

  • The dataset consists of two classifications, Normal and Malware, with a total of 21 zip files. The installation files of each sequence come from 10 application markets such as Google Play, PlayDrone, Anzhi, etc. The Malware classification contains information about the running system call sequences of some APKs in the Drebin dataset.
  • RF, MLP and GBDT models were used to establish benchmarks for the dataset, the use of the models can be found in the source code.
  • The file `merge_all_csv_count_online_check_replenish.csv` is the dataset APK information. We provide APK name (SHA256 name for APK only), classification, APK capacity, number of sequences, log capacity, CVT, OVT value, check time, etc.

Source Code

Release address: https://github.com/HNUSystemsLab/MaDroid

  • The released source code contains two folders, `Source_Code` and `ml_metadata`. Where `Source_Code` is the automated framework for data collection, the tool chain and some notes on the structure of the source files. `ml_metadata` contains the metadata used for machine learning, the partitioned data on which the article builds its benchmark.
  • The automation framework is described in detail in the `Readme.md` document in the `Source_Code` directory. It consists of four main parts: environment requirements, program structure, quick start (working steps), model training and evaluation (including training and evaluation). It describes in detail the preparation of the environment, the data import method, the functional description of each file in the source code directory, the working principle of model training and evaluation, and other related contents.

Tips: 

  • MAS is another name of MaDroid, the content shown here is the final version of "Readme.md".
  • A Large-scale Multi-feature Dataset for Anomaly Detection of Mobile Applications, which is the name of the document during our experiment.

Files

License.md

Files (19.6 GB)

Name Size Download all
md5:661810777aac0f934f500c68511d1c2f
225 Bytes Preview Download
md5:5e60c059f097f740aa26bfc4c14d24e1
779.7 MB Preview Download
md5:cf64a82594737ef29f7201cbd5dd0633
990.1 MB Preview Download
md5:d2f49819499e4a87cf7e3e833ece151d
912.2 MB Preview Download
md5:735b73f2c4243b489b6ac084f0dabafd
1.3 GB Preview Download
md5:1aa485736bc8453974b6a9956c4e4c4f
654.6 MB Preview Download
md5:c18d40c7112e4cfcb420b7f87dcf8570
32.6 MB Preview Download
md5:e241b36c4639f3f6b4744756c3e86c73
453.6 MB Preview Download
md5:93e7cd3d0d971eec70459d9183c6fc61
2.5 GB Preview Download
md5:f3d05a1f7959c56154ec2b612ed0f410
1.5 GB Preview Download
md5:8fc89cf6efe077f9af557518381e755f
284.8 MB Preview Download
md5:5247fdbb3f2e0cd32a2ad498e4bbac6c
486.3 MB Preview Download
md5:8ec3146fa9f1c0d34eda03d073cf54aa
13.1 MB Preview Download
md5:a2cd0efe0611e1aa015d9f701aaf1026
669.5 MB Preview Download
md5:fd7d6219e194882caf5682679e56445c
433.7 MB Preview Download
md5:15cf0de8d55ba5f41c66daf540506ee3
653.2 MB Preview Download
md5:5458bc4d6cca0e3ff02a198df376e7c0
1.1 GB Preview Download
md5:514aba245a4e74ba5d14b7f196b67261
627.7 MB Preview Download
md5:2e27862d241e6842e35d016a98487004
1.0 GB Preview Download
md5:8caee3494a38297fe9fe0e118f0b678b
2.4 GB Preview Download
md5:7f8aaa57d2afb6bcff10e3ee2770cc25
695.6 MB Preview Download
md5:033b967eff2576ffe3fc0337e194d0b9
1.4 GB Preview Download
md5:f4afc8ca7b78b7028e9430ee7b9a95e1
611.8 MB Preview Download
md5:3745ead940334c4101791d415ef2d85b
2.9 kB Preview Download

Additional details

Dates

Submitted
2023-10-31