Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Karamchandani Batra, Amit; Nuñez Fuente, Javier; de la Cal García, Luis; Moreno Meneses, Yenny; Mozo Velasco, Alberto; Pastor Perales, Antonio; R. López, Diego

doi:10.5281/zenodo.14841650

Published July 11, 2024 | Version 1.1.0

Dataset Open

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

1. Universidad Politécnica de Madrid
2. Telefónica Innovación Digital (Spain)

Overview

This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:

A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.

More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

Feature Set:

The feature set includes the following flow statistics commonly used in the literature on network traffic classification:

The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF.
The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot.
The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot.
The cumulative count of data packets sent from source to destination at the time of each snapshot.
The cumulative count of data packets sent from destination to source at the time of each snapshot.
The cumulative bytes sent from source to destination at the time of each snapshot.
The cumulative bytes sent from destination to source at the time of each snapshot.
The time difference between the first packet sent from source to destination and the first packet sent from destination to source.

Dataset Variations:

To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:

All at Once:
1. Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset.
2. This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously.
Balanced Traffic Generation:
1. Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic.
2. Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models.
DDoS at Intervals:
1. Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns.
2. Useful for studying the impact and detection of intermittent malicious activities.
Only Benign HH Traffic:
1. Includes only benign HH traffic flows.
2. Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns.
Only DDoS Traffic:
1. Contains only malicious DDoS HH traffic.
2. Helps in isolating and analyzing attack characteristics for targeted threat detection.
Only Normal Traffic:
1. Comprises only regular, non-HH traffic flows.
2. Useful for understanding baseline network behavior in the absence of heavy hitters.
Unbalanced Traffic Generation:
1. Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic.
2. Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions.

For each variation, the output of the different packet aggregators is provided separated in its respective folder.

Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.

Files

netflow_datasets.zip

Files (367.5 MB)

Name	Size	Download all
netflow_datasets.zip md5:d4489c82e6e87bfc0cca77092f28dcd7	345.5 MB	Preview Download
tstat_datasets.zip md5:143cae2a7649248efb05e745d7bf23d6	22.0 MB	Preview Download

Additional details

European Commission
ACROSS - Automated zero-touch cross-layer provisioning framework for 5G and beyond vertical services 101097122
Ministerio de Asuntos Económicos y Transformación Digital
B5GEMINI-AIUC TSI-063000-2021-79
Ministerio de Asuntos Económicos y Transformación Digital
B5GEMINI-INFRA TSI-063000-2021-81

Collected: 2024-07-11

Repository URL: https://github.com/MMB-UPM/ndt_synthetic_data_hh
Programming language: Python, C, Shell
Development Status: Active

	All versions	This version
Views	238	152
Downloads	73	45
Data volume	9.3 GB	8.8 GB

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Overview

Feature Set:

Dataset Variations:

Files

netflow_datasets.zip

Files (367.5 MB)

Additional details

Funding

Dates

Software

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Creators

Description

Overview

Feature Set:

Dataset Variations:

Files

netflow_datasets.zip

Files (367.5 MB)

Additional details

Funding

Dates

Software