Published April 11, 2024 | Version v2

Dataset: Advanced Similarity Metrics for IP Flow Data Analytics

Description

Analysis of encrypted traffic in computer networks is intricate due to reduced visibility in transmitted content. Machine-learning techniques applied to data representing characteristics of traffic flows provide powerful tools for network monitoring or intrusion detection.  Since real-world datasets are scarce, we present a novel traffic classification dataset with TLS traffic. The dataset contains three days (19-08-2022 -- 21-08-2022) of anonymized communication on CESNET3 ISP network, which is used by approximately half a million users daily.

Ethic statement The privacy of the CESNET network users is a fundamental concern in our work, leading us to conduct our research with careful consideration. The indisputable advantages of real traffic generated by hundreds of thousands of people come with understandable privacy concerns. Thus, we used only automatic data processing with immediate data anonymization. With this, we declare that we did not analyze or manually process non-anonymized data or perform any procedures that could allow us to track users or reveal their identities. 

Data description The dataset consists of network flows describing encrypted TLS communications. Flows are extended with packet sequences, histograms, and fields extracted from the TLS ClientHello message, which is transmitted in the first packet of the TLS connection handshake. The most important extracted handshake field is the SNI domain, which is used for ground-truth labeling. 

Packet Sequences Sequences of packet sizes, directions, and inter-packet times are standard data input for traffic analysis. For packet sizes, we consider the payload size after transport headers (TCP headers for the TLS case). We omit packets with no TCP payload, for example ACKs, because zero-payload packets are related to the transport layer internals rather than services’ behavior. Packet directions are encoded as ±1, where +1 means a packet sent from client to server, and -1 is a packet from server to client. Packet timing depends on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate a response. Packet sequences have a maximum length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time-stamps, and TCP flags. 

Flow statistics Each data record also includes standard flow statistics, representing aggregated information about the entire bidirectional connection. The fields are the number of transmitted bytes and packets in both directions, the duration of the flow, and packet histograms. The packet histograms include binned counts (not limited to the first 30 packets) of packet sizes and inter-packet times in both directions. There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes (More information in the PHISTS plugin documentation). 

Dataset structure The dataset is organized per individual days and hours. The flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. The following list describes flow data fields in CSV files:

  • TCP_FLAGS: Logical OR of all TCP flags transmitted from client to server
  • TCP_FLAGS_REV: Logical OR of all TCP flags transmitted from server to client
  • TLS_SNI: Server Name Indication domain
  • TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • DURATION: Duration of the flow in seconds
  • BYTES: Number of transmitted bytes from client to server
  • BYTES_REV: Number of transmitted bytes from server to client
  • PACKETS: Number of packets transmitted from client to server
  • PACKETS_REV: Number of packets transmitted from server to client
  • PPI_PKT_DIRECTIONS: Direction of PPI sequence 
  • PPI_PKT_FLAGS: TCP flags of PPI sequence
  • PPI_PKT_TIMES: Timestamps of individual packets in PPI sequence
  • PPI_PKT_LENGTHS: Lengths of individual packets in PPI sequence
  • S_PHISTS_SIZES: Histogram of packet sizes from client to server
  • D_PHISTS_SIZES: Histogram of packet sizes from server to client
  • S_PHISTS_IPT: Histogram of inter-packet times from client to server
  • D_PHISTS_IPT: Histogram of inter-packet times from server to client

 

The dataset also contains a service map in the form of a CSV file. The service map can be used to extract high-level labels from SNI domain names. 

 

The directory tree of the dataset is:

.
├── 20220819
│   ├── flows.202208190000.csv
│   ├── flows.202208190100.csv
|   ├── ...
│   └── flows.202208192300.csv
├── 20220820
│   ├── flows.202208200000.csv
│   ├── flows.202208200100.csv
|   ├── ...
│   └── flows.202208202300.csv
└── 20220821
    ├── flows.202208210000.csv
    ├── flows.202208210100.csv
    ├── ...
    └── flows.202208212300.csv

Files

servicemap.csv

Files (2.8 GB)

Name Size
md5:ffd0cd48a7a9d25cb86d79f9bb50c94b
2.8 GB Download
md5:6b60b79fd72048354639a01ffc0f83ee
62.3 kB Preview Download