Dataset: Advanced Similarity Metrics for IP Flow Data Analytics

Petr, Ivo; Hrabáková, Jitka; Friedjungová, Magda; Vašata, Daniel; Hynek, Karel

doi:10.5281/zenodo.14035306

Published April 11, 2024 | Version v2

Dataset Open

Dataset: Advanced Similarity Metrics for IP Flow Data Analytics

1. Czech Technical University in Prague
2. CESNET

Analysis of encrypted traffic in computer networks is intricate due to reduced visibility in transmitted content. Machine-learning techniques applied to data representing characteristics of traffic flows provide powerful tools for network monitoring or intrusion detection. Since real-world datasets are scarce, we present a novel traffic classification dataset with TLS traffic. The dataset contains three days (19-08-2022 -- 21-08-2022) of anonymized communication on CESNET3 ISP network, which is used by approximately half a million users daily.

Ethic statement The privacy of the CESNET network users is a fundamental concern in our work, leading us to conduct our research with careful consideration. The indisputable advantages of real traffic generated by hundreds of thousands of people come with understandable privacy concerns. Thus, we used only automatic data processing with immediate data anonymization. With this, we declare that we did not analyze or manually process non-anonymized data or perform any procedures that could allow us to track users or reveal their identities.

Data description The dataset consists of network flows describing encrypted TLS communications. Flows are extended with packet sequences, histograms, and fields extracted from the TLS ClientHello message, which is transmitted in the first packet of the TLS connection handshake. The most important extracted handshake field is the SNI domain, which is used for ground-truth labeling.

Packet Sequences Sequences of packet sizes, directions, and inter-packet times are standard data input for traffic analysis. For packet sizes, we consider the payload size after transport headers (TCP headers for the TLS case). We omit packets with no TCP payload, for example ACKs, because zero-payload packets are related to the transport layer internals rather than services’ behavior. Packet directions are encoded as ±1, where +1 means a packet sent from client to server, and -1 is a packet from server to client. Packet timing depends on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate a response. Packet sequences have a maximum length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time-stamps, and TCP flags.

Flow statistics Each data record also includes standard flow statistics, representing aggregated information about the entire bidirectional connection. The fields are the number of transmitted bytes and packets in both directions, the duration of the flow, and packet histograms. The packet histograms include binned counts (not limited to the first 30 packets) of packet sizes and inter-packet times in both directions. There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes (More information in the PHISTS plugin documentation).

Dataset structure The dataset is organized per individual days and hours. The flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. The following list describes flow data fields in CSV files:

TCP_FLAGS: Logical OR of all TCP flags transmitted from client to server
TCP_FLAGS_REV: Logical OR of all TCP flags transmitted from server to client
TLS_SNI: Server Name Indication domain
TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff
TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff
DURATION: Duration of the flow in seconds
BYTES: Number of transmitted bytes from client to server
BYTES_REV: Number of transmitted bytes from server to client
PACKETS: Number of packets transmitted from client to server
PACKETS_REV: Number of packets transmitted from server to client
PPI_PKT_DIRECTIONS: Direction of PPI sequence
PPI_PKT_FLAGS: TCP flags of PPI sequence
PPI_PKT_TIMES: Timestamps of individual packets in PPI sequence
PPI_PKT_LENGTHS: Lengths of individual packets in PPI sequence
S_PHISTS_SIZES: Histogram of packet sizes from client to server
D_PHISTS_SIZES: Histogram of packet sizes from server to client
S_PHISTS_IPT: Histogram of inter-packet times from client to server
D_PHISTS_IPT: Histogram of inter-packet times from server to client

The dataset also contains a service map in the form of a CSV file. The service map can be used to extract high-level labels from SNI domain names.

The directory tree of the dataset is:

.
├── 20220819
│   ├── flows.202208190000.csv
│   ├── flows.202208190100.csv
|   ├── ...
│   └── flows.202208192300.csv
├── 20220820
│   ├── flows.202208200000.csv
│   ├── flows.202208200100.csv
|   ├── ...
│   └── flows.202208202300.csv
└── 20220821
    ├── flows.202208210000.csv
    ├── flows.202208210100.csv
    ├── ...
    └── flows.202208212300.csv

Files

servicemap.csv

Files (2.8 GB)

Name	Size
advanced_similarity_flow_dataset.tar.gz md5:ffd0cd48a7a9d25cb86d79f9bb50c94b	2.8 GB	Download
servicemap.csv md5:6b60b79fd72048354639a01ffc0f83ee	62.3 kB	Preview Download

	All versions	This version
Views	199	123
Downloads	152	88
Data volume	89.6 GB	50.4 GB

Dataset: Advanced Similarity Metrics for IP Flow Data Analytics

Authors/Creators

Description

Files

servicemap.csv

Files (2.8 GB)