Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows
Creators
Description
Overview
This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:
A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.
More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
Feature Set:
The feature set includes the following flow statistics commonly used in the literature on network traffic classification:
- The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF.
- The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot.
- The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot.
- The cumulative count of data packets sent from source to destination at the time of each snapshot.
- The cumulative count of data packets sent from destination to source at the time of each snapshot.
- The cumulative bytes sent from source to destination at the time of each snapshot.
- The cumulative bytes sent from destination to source at the time of each snapshot.
- The time difference between the first packet sent from source to destination and the first packet sent from destination to source.
Dataset Variations:
To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:
-
All at Once
:- Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset.
- This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously.
-
Balanced Traffic Generation
:- Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic.
- Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models.
-
DDoS at Intervals
:- Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns.
- Useful for studying the impact and detection of intermittent malicious activities.
-
Only Benign HH Traffic
:- Includes only benign HH traffic flows.
- Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns.
-
Only DDoS Traffic
:- Contains only malicious DDoS HH traffic.
- Helps in isolating and analyzing attack characteristics for targeted threat detection.
-
Only Normal Traffic
:- Comprises only regular, non-HH traffic flows.
- Useful for understanding baseline network behavior in the absence of heavy hitters.
-
Unbalanced Traffic Generation
:- Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic.
- Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions.
For each variation, the output of the different packet aggregators is provided separated in its respective folder.
Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
Files
netflow_datasets.zip
Files
(367.5 MB)
Name | Size | Download all |
---|---|---|
md5:d4489c82e6e87bfc0cca77092f28dcd7
|
345.5 MB | Preview Download |
md5:143cae2a7649248efb05e745d7bf23d6
|
22.0 MB | Preview Download |
Additional details
Funding
- European Commission
- ACROSS - Automated zero-touch cross-layer provisioning framework for 5G and beyond vertical services 101097122
- Ministerio de Asuntos Económicos y Transformación Digital
- B5GEMINI-AIUC TSI-063000-2021-79
- Ministerio de Asuntos Económicos y Transformación Digital
- B5GEMINI-INFRA TSI-063000-2021-81
Dates
- Collected
-
2024-07-11
Software
- Repository URL
- https://github.com/MMB-UPM/ndt_synthetic_data_hh
- Programming language
- Python, C, Shell
- Development Status
- Active