Dataset Open Access

# AIT Log Data Set V2.0

Landauer, Max; Skopik, Florian; Frank, Maximilian; Hotwagner, Wolfgang; Wurzenberger, Markus; Rauber, Andreas

### Citation Style Language JSON Export

{
"publisher": "Zenodo",
"DOI": "10.5281/zenodo.5789064",
"title": "AIT Log Data Set V2.0",
"issued": {
"date-parts": [
[
2022,
2,
24
]
]
},
"abstract": "<p><strong>AIT Log Data Sets</strong></p>\n\n<p>This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.</p>\n\n<p>In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps are launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the <a href=\"https://zenodo.org/record/4264796\">AIT-LDSv1.1</a>, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the <a href=\"https://zenodo.org/record/6610489\">AIT-NDS</a> containing the labeled netflows of the testbed networks.</p>\n\n<p>The datasets in this repository have the following structure:</p>\n\n<ul>\n\t<li>The <em>gather </em>directory contains all logs collected from the testbed. Logs collected from each host are located in <em>gather/&lt;host_name&gt;/logs/</em>.</li>\n\t<li>The <em>labels </em>directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file (&quot;line&quot;), the labels assigned to the line that state the respective attack step (&quot;labels&quot;), and the labeling rules that assigned the labels (&quot;rules&quot;).</li>\n\t<li>The <em>processing </em>directory contains the source code that was used to generate the labels.</li>\n\t<li>The <em>rules </em>directory contains the labeling rules.</li>\n\t<li>The <em>environment </em>directory contains the source code that was used to deploy the testbed and run the simulation using the <a href=\"https://github.com/ait-aecid/kyoushi-environment\">Kyoushi Testbed Environment</a>.</li>\n\t<li>The <em>dataset.yml</em> file specifies the start and end time of the simulation.</li>\n</ul>\n\n<p>The following table summarizes relevant properties of the datasets:</p>\n\n<table align=\"center\">\n\t<thead>\n\t\t<tr>\n\t\t\t<th scope=\"row\">Dataset</th>\n\t\t\t<th scope=\"col\">Simulation time</th>\n\t\t\t<th scope=\"col\">Attack time</th>\n\t\t\t<th scope=\"col\">Exfiltration visible in DNS logs</th>\n\t\t\t<th scope=\"col\">Scan volume</th>\n\t\t\t<th scope=\"col\">Password cracking</th>\n\t\t\t<th scope=\"col\">Unpacked size</th>\n\t\t</tr>\n\t</thead>\n\t<tbody>\n\t\t<tr>\n\t\t\t<th scope=\"row\">fox</th>\n\t\t\t<td>2022-01-15 00:00 - 2022-01-20 00:00</td>\n\t\t\t<td>2022-01-18 11:59 - 2022-01-18 13:15</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>High</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>26 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">harrison</th>\n\t\t\t<td>2022-02-04 00:00 - 2022-02-09 00:00</td>\n\t\t\t<td>2022-02-08 07:07 - 2022-02-08 08:38</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>High</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>27 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">russellmitchell</th>\n\t\t\t<td>2022-01-21 00:00 - 2022-01-25 00:00</td>\n\t\t\t<td>2022-01-24 03:01 - 2022-01-24 04:39</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>Low</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>14 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">santos</th>\n\t\t\t<td>2022-01-14 00:00 - 2022-01-18 00:00</td>\n\t\t\t<td>2022-01-17 11:15 - 2022-01-17 11:59</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>Low</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>17 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">shaw</th>\n\t\t\t<td>2022-01-25 00:00 - 2022-01-31 00:00</td>\n\t\t\t<td>2022-01-29 14:37 - 2022-01-29 15:21</td>\n\t\t\t<td>No</td>\n\t\t\t<td>Low</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>27 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">wardbeck</th>\n\t\t\t<td>2022-01-19 00:00 - 2022-01-24 00:00</td>\n\t\t\t<td>2022-01-23 12:10 - 2022-01-23 12:56</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>Low</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>26 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">wheeler</th>\n\t\t\t<td>2022-01-26 00:00 - 2022-01-31 00:00</td>\n\t\t\t<td>2022-01-30 07:35 - 2022-01-30 17:53</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>High</td>\n\t\t\t<td>No</td>\n\t\t\t<td>30 GB</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<th scope=\"row\">wilson</th>\n\t\t\t<td>2022-02-03 00:00 - 2022-02-09 00:00</td>\n\t\t\t<td>2022-02-07 10:57 - 2022-02-07 11:49</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>High</td>\n\t\t\t<td>Yes</td>\n\t\t\t<td>39 GB</td>\n\t\t</tr>\n\t</tbody>\n</table>\n\n<p>The following attacks are launched in the network:</p>\n\n<ul>\n\t<li>Scans (nmap, WPScan, dirb)</li>\n\t<li>Webshell upload (CVE-2020-24186)</li>\n\t<li>Password cracking (John the Ripper)</li>\n\t<li>Privilege escalation</li>\n\t<li>Remote command execution</li>\n\t<li>Data exfiltration (DNSteal)</li>\n</ul>\n\n<p>Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.</p>\n\n<p>Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in <em>gather/attacker_0/logs/attacks.log</em>. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in <em>gather/&lt;host_name&gt;/configs/</em> and <em>gather/&lt;host_name&gt;/facts.json</em>.</p>\n\n<p>Version history:</p>\n\n<ul>\n\t<li><a href=\"https://doi.org/10.5281/zenodo.3723082\">AIT-LDS-v1.x</a>: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.</li>\n\t<li><a href=\"http://doi.org/10.5281/zenodo.5789064\">AIT-LDS-v2.0</a>: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.</li>\n</ul>\n\n<p>Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).</p>\n\n<p>If you use the dataset, please cite the following publications:</p>\n\n<p>[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. &quot;Maintainable Log Datasets for Evaluation of Intrusion Detection Systems&quot;. Under Review. <a href=\"https://arxiv.org/abs/2203.08580\">arXiv:2203.08580</a> [<a href=\"https://arxiv.org/pdf/2203.08580.pdf\">PDF</a>]</p>\n\n<p>[2]&nbsp;M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, <a href=\"https://ieeexplore.ieee.org/document/9262078\">&quot;Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed,&quot;</a> in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [<a href=\"https://www.skopik.at/ait/2020_trel.pdf\">PDF</a>]</p>",
"author": [
{
"family": "Landauer, Max"
},
{
"family": "Skopik, Florian"
},
{
"family": "Frank, Maximilian"
},
{
"family": "Hotwagner, Wolfgang"
},
{
"family": "Wurzenberger, Markus"
},
{
"family": "Rauber, Andreas"
}
],
"note": "M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. \"Maintainable Log Datasets for Evaluation of Intrusion Detection Systems\". arXiv:2203.08580",
"version": "v2_0",
"type": "dataset",
"id": "5789064"
}
376
9,232
views