Published October 22, 2022 | Version DongTing 2022
Dataset Open

DongTing: A Large-scale Dataset for Anomaly Detection of the Linux Kernel

  • 1. dguoyun@hnu.edu.cn
  • 2. fuyuanzhi@hnu.edu.cn
  • 3. caiminjie@hnu.edu.cn
  • 4. haochen@hnu.edu.cn
  • 5. jhsun@hnu.edu.cn

Description

DongTing is the first large-scale dataset dedicated to Linux kernel anomaly detection. The dataset covers Linux kernels released in the last five years and includes a total of 18,966 well-labeled normal and attack sequences. The entire dataset is 85 GB in size (after decompression). The attack data covers 26 major kernel releases and contains a total of 12,116 system call sequences collected from running 17,855 bug-triggering programs. The normal data comes from 6,850 normal programs in four kernel regression test suites. We maintain the dataset and source code in Zenodo and Github, respectively, and back up the dataset and code in Baidu netdisk.

Dataset

The dataset is stored at http://doi.org/10.5281/zenodo.6627050

  • The data includes `abnormal_data`, `normal_data`, `models`, `npz` and baseline data, with a total volume of nearly 87 GB (including 85 GB for abnormal data and normal data, it's after decompression files size).
  • The `Abnormal_data` directory contains 12,116 files containing system call sequence for 26 kernel releases, and the `Normal_data` directory contains 6,850 files containing system call sequences collected from four regression test suites. All of which are raw sequences.
  • CNN/RNN, LSTM, and Wavenet (three sets of hyperparameters per model) machine learning models are selected, the ECOD model (without hyperparameters) was also chosen for the evaluation of DT. DT_abnormal, DT_normal, ADFA-LD, and PLAID are used for training respectively. The results of DT training models are stored in the directory `Models-DongTing`, and the results of ADFA-LD and PLAID training models are stored in the  directory `Models-Comparison`.
  • The directory `npz `stores the encoded dataset of DongTing, ADFA-LD, and PLAID (sequence length varies from  8 to 4495), according to syscall_64.tbl in Linux kernel 5.17, including the training set, validation set, and test set.
  • The file `Baseline.xlsx` contains all the information about DongTing dataset, which can be used in training machine learning models. For example, the whole dataset is  randomly divided into three sets with the ratio of 80%:10%:10% (training: validation: test). The implementation of dataset division can be found in the source code.

Source Code


The source code for dataset development is stored at https://github.com/HNUSystemsLab/DongTing and the following is a brief introduction.

  • The source code contains three folders, i.e., `Source Code Files`, `Documents` and `DB`, where `Documents `stores the detailed  documents related to development, `DB` stores samples data, and `Source Code Files` stores the source code related to the development of our dataset.
  • The detailed description about the source code can be found in `Documents/Documentation.pdf`. The document consists of four parts: environment requirements, database, program structure and working steps, model training and evaluation (including training and evaluation). It details the preparation of the environment, data import method, functional description of each file in the source code directory, how model training and evaluation work and other related contents.

We additionally maintain the dataset and source code on Baidu.com https://pan.baidu.com/s/1vu1WGZpf2DqMIoyGayNu3w?pwd=dtds to facilitate the access from China.

 

Tips: 

If you find DongTing useful for your research, please cite the article as "DongTing: A large-scale dataset for anomaly detection of the Linux kernel".


@article{DUAN2023111745,
title = {DongTing: A large-scale dataset for anomaly detection of the Linux kernel},
journal = {Journal of Systems and Software},
volume = {203},
pages = {111745},
year = {2023},
issn = {0164-1212},
doi = {https://doi.org/10.1016/j.jss.2023.111745},
url = {https://www.sciencedirect.com/science/article/pii/S0164121223001401},
author = {Guoyun Duan and Yuanzhi Fu and Minjie Cai and Hao Chen and Jianhua Sun}
}

Files

Abnormal_data.zip

Files (4.5 GB)

Name Size Download all
md5:633f97d1b9f67f60b0d00e117ff63df7
2.8 GB Preview Download
md5:eb39b0f0c2650ec7e276b38ce4ffa172
1.3 MB Download
md5:6908197a0b6da68807ff402e1f6aaeca
230 Bytes Preview Download
md5:26b3c27f9ff2ab2e3492e8e770da691d
866.5 MB Preview Download
md5:c02a6d91e08210d8029a0294c10e8100
845.6 MB Preview Download
md5:9cbc827df4e14e7c663ffce5494ce740
7.7 MB Preview Download
md5:83fdd623d1c047e258857ff1e477604e
4.1 MB Preview Download
md5:dfe043a566c5f534de70f04a7d80922b
3.4 kB Preview Download
md5:d7322996e50b127bb2f94f2e2dfe7e92
14.8 kB Download