Dataset Open Access

Kyoushi Log Data Set

Landauer Max; Frank Maximilian; Skopik Florian; Hotwagner Wolfgang; Wurzenberger Markus; Rauber Andreas

This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network.

The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

Each dataset contains traces of a specific attack scenario:

  • Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):
    • nmap scan
    • WPScan
    • dirb scan
    • webshell upload through wpDiscuz exploit (CVE-2020-24186)
    • privilege escalation
  • Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):
    • DNSteal data exfiltration

The log data collected from the servers includes

  • Apache access and error logs (labeled)
  • audit logs (labeled)
  • auth logs (labeled)
  • VPN logs (labeled)
  • DNS logs (labeled)
  • syslog
  • suricata logs
  • exim logs
  • horde logs
  • mail logs

Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

[2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". Under Review.

[3] M. Frank, Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems, Master's Thesis, Vienna University of Technology, 2021.

Additionally funded by the FFG projects INDICAETING (868306) and DECEPT (873980).
Files (1.3 GB)
Name Size
783.0 MB Download
538.6 MB Download
All versions This version
Views 4444
Downloads 77
Data volume 5.0 GB5.0 GB
Unique views 3535
Unique downloads 44


Cite as