Network traffic datasets with novel extended IP flow called NetTiSA flow

Josef Koumar; Karel Hynek; Jaroslav Pešek; Tomáš Čejka

doi:10.5281/zenodo.8301043

Published August 30, 2023 | Version v1

Dataset Open

Network traffic datasets with novel extended IP flow called NetTiSA flow

1. Czech Technical University in Prague
2. CESNET, a.l.e.

Network traffic datasets with novel extended IP flow called NetTiSA flow

Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147

Please cite the usage of our datasets as:

Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286
@article{KOUMAR2024110147,
title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification},
journal = {Computer Networks},
volume = {240},
pages = {110147},
year = {2024},
issn = {1389-1286},
doi = {https://doi.org/10.1016/j.comnet.2023.110147},
url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923},
author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka}
}

This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.

NetTiSA flow feature vector

The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.

Flow features

The flow features are:

Packets is the number of packets in the direction from the source to the destination IP address.
Packets in reverse order is the number of packets in the direction from the destination to the source IP address.
Bytes is the size of the payload in bytes transferred in the direction from the source to the destination IP address.
Bytes in reverse order is the size of the payload in bytes transferred in the direction from the destination to the source IP address.

Statistical and Time-based features

The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:

Mean represents mean of the payload lengths of packets
Min is the minimal value from payload lengths of all packets in a flow
Max is the maximum value from payload lengths of all packets in a flow
Standard deviation is a measure of the variation of payload lengths from the mean payload length
Root mean square is the measure of the magnitude of payload lengths of packets
Average dispersion is the average absolute difference between each payload length of the packet and the mean value
Kurtosis is the measure describing the extent to which the tails of a distribution differ from the tails of a normal distribution
Mean of relative times is the mean of the relative times which is a sequence defined as \(st = \{t_1 - t_1, t_2 - t_1, ..., t_n - t_1\} \)
Mean of time differences is the mean of the time differences which is a sequence defined as \(dt = \{ t_j - t_i | j = i + 1, i \in \{1, 2, \dots, n - 1\} \}.\)
Min from time differences is the minimal value from all time differences, i.e., min space between packets.
Max from time differences is the maximum value from all time differences, i.e., max space between packets.
Time distribution describes the deviation of time differences between individual packets within the time series. The feature is computed by the following equation:
\(tdist = \frac{ \frac{1}{n-1} \sum_{i=1}^{n-1} \left| \mu_{\{dt_{n-1}\}} - dt_i \right| }{ \frac{1}{2} \left(max\left(\{dt_{n-1}\}\right) - min\left(\{dt_{n-1}\}\right) \right) }\)
Switching ratio represents a value change ratio (switching) between payload lengths. The switching ratio is computed by equation:
\(sr = \frac{s_n}{\frac{1}{2} (n - 1)}\)

where \(s_n\) is number of switches.

Features computed at the collector
The third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:

Max minus min is the difference between minimum and maximum payload lengths
Percent deviation is the dispersion of the average absolute difference to the mean value
Variance is the spread measure of the data from its mean
Burstiness is the degree of peakedness in the central part of the distribution
Coefficient of variation is a dimensionless quantity that compares the dispersion of a time series to its mean value and is often used to compare the variability of different time series that have different units of measurement
Directions describe a percentage ratio of packet direction computed as \(\frac{d_1}{ d_1 + d_0}\), where \(d_1\) is a number of packets in a direction from source to destination IP address and \(d_0\) the opposite direction. Both \(d_1\) and \(d_0\) are inside the classical bidirectional flow.
Duration is the duration of the flow

The NetTiSA flow is implemented into IP flow exporter ipfixprobe.

Description of dataset files

In the following table is a description of each dataset file:

File name	Detection problem	Citation of the original raw dataset
botnet_binary.csv	Binary detection of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv	Multi-class classification of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv	Binary detection of cryptomining; the design part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv	Binary detection of cryptomining; the evaluation part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv	Binary detection of malware DNS	Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv	Binary detection of DoH	Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv	Binary detection of DoH	Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv	Binary detection of DoS	Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv	Binary detection of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv	Multi-class classification of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv	Binary detection of HTTPS Brute Force	Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv	Binary detection of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv	Multi-class classification of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
unsw_binary.csv	Binary detection of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
unsw_multiclass.csv	Multi-class classification of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv	Binary detection of IoT malware	Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv	Binary detection of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv	Multi-class classification of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
tor_binary.csv	Binary detection of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
tor_multiclass.csv	Multi-class classification of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
vpn_iscx_binary.csv	Binary detection of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_iscx_multiclass.csv	Multi-class classification of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_vnat_binary.csv	Binary detection of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022
vpn_vnat_multiclass.csv	Multi-class classification of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

Notes

This research was funded by the Ministry of Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis and also by the Grant Agency of the CTU in Prague, grant No. SGS23/207/OHK3/3T/18 funded by the MEYS of the Czech Republic.

Files

botnet_binary.csv

Files (12.6 GB)

Name	Size	Download all
botnet_binary.csv md5:a566319af00b219c32d05c8bcf51dd6f	95.9 MB	Preview Download
botnet_multiclass.csv md5:ace316a737c46787175318f008eb45a0	49.6 MB	Preview Download
decrypto_dataset_design.csv md5:b190a69a94b89c7ccdd6273843811995	555.6 MB	Preview Download
decrypto_dataset_evaluation.csv md5:bd005f5c3566f79f5d86f5578c9277f4	292.3 MB	Preview Download
dns_malware.csv md5:58120fa96fa784b44bc171639d6b8831	2.4 MB	Preview Download
doh_cic.csv md5:30492bdbf652b1ab87540c0ec3ad4e79	358.2 MB	Preview Download
doh_real_world.csv md5:038042ee23d78dd296769888725f0fe3	1.9 GB	Preview Download
dos.csv md5:3842426b13b5cef8d9ad4f1807ab6457	994.3 MB	Preview Download
edge_iiot_binary.csv md5:c238ace50cca7d2f190b8367bd4a392a	496.1 MB	Preview Download
edge_iiot_multiclass.csv md5:687a163e0cbb2dcfd9e704ddc2f15ef0	500.4 MB	Preview Download
http_bruteforce.csv md5:98a72dde775055e70cf76f509c0182ca	400.7 MB	Preview Download
ids_cic_binary.csv md5:6fd556eeb5d5fc8061d225c699c322bf	785.7 MB	Preview Download
ids_cic_multiclass.csv md5:86a6bb77c1901f15976f76b3f782d9b8	790.0 MB	Preview Download
iot_23.csv md5:f06953dcf2d7ad5a0e7174e5a9c96bf0	1.1 GB	Preview Download
ton_iot_binary.csv md5:7c6254b648c2c46b1de8507f9654d5f0	1.2 GB	Preview Download
ton_iot_mutliclass.csv md5:52c4b0e7936c2ccdad571646309723c5	1.2 GB	Preview Download
tor_binary.csv md5:67833b82e1d5de3f220730d4f399c7fb	13.6 MB	Preview Download
tor_multiclass.csv md5:c224b8595728edeefaa7d18cc4faf2d7	38.6 MB	Preview Download
unsw_binary.csv md5:798ed6ec63e8a61680c159c6a3f28c48	620.3 MB	Preview Download
unsw_multiclass.csv md5:ceef01fe35a8e0875d737afb49803a47	1.1 GB	Preview Download
vpn_iscx_binary.csv md5:ffad32292e4d86782878def03eaa6dc4	54.8 MB	Preview Download
vpn_iscx_multiclass.csv md5:397c5a294e9cef574bf5d32da0e61d2b	7.4 MB	Preview Download
vpn_vnat_binary.csv md5:bef30747b30f9d267b624ca1174e942d	13.3 MB	Preview Download
vpn_vnat_multiclass.csv md5:c699c8cd8be5e02af9efde6124294c5b	13.4 MB	Preview Download

Additional details

S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

	All versions	This version
Views	808	793
Downloads	2,869	2,862
Data volume	1.9 TB	1.9 TB

Network traffic datasets with novel extended IP flow called NetTiSA flow

Creators

Description

Notes

Files

botnet_binary.csv

Files (12.6 GB)

Additional details

References