Published February 2, 2024 | Version v1
Dataset Open

CESNET-TLS-Year22: A year-spanning TLS network traffic dataset from backbone lines

Description

We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.

The modern approach for network traffic classification (TC), which is an important part of operating and securing networks, is to use machine learning (ML) models that are able to learn intricate relationships between traffic characteristics and communicating applications. A crucial prerequisite is having representative datasets. However, datasets collected from real production networks are not being published in sufficient numbers. Thus, this paper presents a novel dataset, CESNET-TLS-Year22, that captures the evolution of TLS traffic in an ISP network over a year. The dataset contains 180 web service labels and standard TC features, such as packet sequences. The unique year-long time span enables comprehensive evaluation of TC models and assessment of their robustness in the face of the ever-changing environment of production networks.

Data description The dataset consists of network flows describing encrypted TLS communications. Flows are extended with packet sequences, histograms, and fields extracted from the TLS ClientHello message, which is transmitted in the first packet of the TLS connection handshake. The most important extracted handshake field is the SNI domain, which is used for ground-truth labeling. 

Packet Sequences Sequences of packet sizes, directions, and inter-packet times are standard data input for traffic analysis. For packet sizes, we consider the payload size after transport headers (TCP headers for the TLS case). We omit packets with no TCP payload, for example ACKs, because zero-payload packets are related to the transport layer internals rather than services’ behavior. Packet directions are encoded as ±1, where +1 means a packet sent from client to server, and -1 is a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate a response. Packet sequences have a maximum length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction; in other words, each client request and server response pair counts as one roundtrip.

Flow statistics Each data record also includes standard flow statistics, representing aggregated information about the entire bidirectional connection. The fields are the number of transmitted bytes and packets in both directions, the duration of the flow, and packet histograms. The packet histograms include binned counts (not limited to the first 30 packets) of packet sizes and inter-packet times in both directions. There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes (More information in the PHISTS plugin documentation). Moreover, each flow has its end reason---either it ended with the TCP connection termination (FIN packets), was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons.

Dataset structure The dataset is organized per weeks and individual days. The flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the total number of saved flows and the number of flows per service. There are also files aggregating flow counts for each week (stats-week.json) and for the entire dataset (stats-dataset.json).  The following list describes flow data fields in CSV files: 

  • ID: Unique identifier
  • SRC_IP: Source IP address
  • DST_IP: Destination IP address
  • DST_ASN: Destination Autonomous System number
  • SRC_PORT: Source port
  • DST_PORT: Destination port
  • PROTOCOL: Transport protocol
  • FLAG_CWR: Presence of the CWR flag
  • FLAG_CWR_REV: Presence of the CWR flag in the reverse direction
  • FLAG_ECE: Presence of the ECE flag
  • FLAG_ECE_REV: Presence of the ECE flag in the reverse direction
  • FLAG_URG: Presence of the URG flag
  • FLAG_URG_REV: Presence of the URG flag in the reverse direction
  • FLAG_ACK: Presence of the ACK flag
  • FLAG_ACK_REV: Presence of the ACK flag in the reverse direction
  • FLAG_PSH: Presence of the PSH flag
  • FLAG_PSH_REV: Presence of the PSH flag in the reverse direction
  • FLAG_RST: Presence of the RST flag
  • FLAG_RST_REV: Presence of the RST flag in the reverse direction
  • FLAG_SYN: Presence of the SYN flag
  • FLAG_SYN_REV: Presence of the SYN flag in the reverse direction
  • FLAG_FIN: Presence of the FIN flag
  • FLAG_FIN_REV: Presence of the FIN flag in the reverse direction
  • TLS_SNI: Server Name Indication domain
  • TLS_JA3: JA3 fingerprint of TLS client
  • TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • DURATION: Duration of the flow in seconds
  • BYTES: Number of transmitted bytes from client to server
  • BYTES_REV: Number of transmitted bytes from server to client
  • PACKETS: Number of packets transmitted from client to server
  • PACKETS_REV: Number of packets transmitted from server to client
  • PPI: Packet sequence in the format: [[inter-packet times], [packet directions], [packet sizes]]
  • PPI_LEN: Number of packets in the PPI sequence
  • PPI_DURATION: Duration of the PPI sequence in seconds
  • PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
  • PHIST_SRC_SIZES: Histogram of packet sizes from client to server
  • PHIST_DST_SIZES: Histogram of packet sizes from server to client
  • PHIST_SRC_IPT: Histogram of inter-packet times from client to server
  • PHIST_DST_IPT: Histogram of inter-packet times from server to client
  • APP: Web service label
  • CATEGORY: Service category
  • FLOW_ENDREASON_IDLE: Flow was terminated because it was idle
  • FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout
  • FLOW_ENDREASON_END: Flow ended with the TCP connection termination
  • FLOW_ENDREASON_OTHER: Flow was terminated for other reasons 

Files

CESNET-TLS-Year22.zip

Files (30.5 GB)

Name Size Download all
md5:d0dd7c84e2140bba362f6bd23de5cab7
30.5 GB Preview Download
md5:7cff762d29224aa4f05f51a588a8e8fa
24.8 kB Preview Download

Additional details

Funding

Flow-Based Encrypted Traffic Analysis VJ02010024
Ministerio del Interior

Dates

Collected
2022-01-01
Start of Collection
Collected
2022-12-31
End of Collection