There is a newer version of the record available.

Published May 23, 2023 | Version v2
Dataset Open

CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines

  • 1. FIT CTU & CESNET z.s.p.o
  • 2. CESNET z.s.p.o
  • 3. FIT Czech Technical University in Prague

Description

Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888

We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.

The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies.

Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

  • W-2022-44
    • Uncompressed Size: 19 GB
    • Capture Period: 31.10.2022 - 6.11.2022
    • Number of flows: 32.6M
  • W-2022-45
    • Uncompressed Size: 25 GB
    • Capture Period: 7.11.2022 - 13.11.2022
    • Number of flows: 42.6M
  • W-2022-46
    • Uncompressed Size: 20 GB
    • Capture Period: 14.11.2022 - 20.11.2022
    • Number of flows: 33.7M
  • W-2022-47
    • Uncompressed Size: 25 GB
    • Capture Period: 21.11.2022 - 27.11.2022
    • Number of flows: 44.1M
  • CESNET-QUIC22 
    • Uncompressed Size: 89 GB
    • Capture Period: 31.10.2022 - 27.11.2022
    • Number of flows: 153M

 

Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications.

Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip.

Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections.

Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

  • ID: Unique identifier
  • SRC_IP: Source IP address
  • DST_IP: Destination IP address
  • DST_ASN: Destination Autonomous System number
  • SRC_PORT: Source port
  • DST_PORT: Destination port
  • PROTOCOL: Transport protocol
  • QUIC_VERSION QUIC: protocol version
  • QUIC_SNI: Server Name Indication domain
  • QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet
  • TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff
  • DURATION: Duration of the flow in seconds
  • BYTES: Number of transmitted bytes from client to server
  • BYTES_REV: Number of transmitted bytes from server to client
  • PACKETS: Number of packets transmitted from client to server
  • PACKETS_REV: Number of packets transmitted from server to client
  • PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]]
  • PPI_LEN: Number of packets in the PPI sequence
  • PPI_DURATION: Duration of the PPI sequence in seconds
  • PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
  • PHIST_SRC_SIZES: Histogram of packet sizes from client to server
  • PHIST_DST_SIZES: Histogram of packet sizes from server to client
  • PHIST_SRC_IPT: Histogram of inter-packet times from client to server
  • PHIST_DST_IPT: Histogram of inter-packet times from server to client
  • APP: Web service label
  • CATEGORY: Service category
  • FLOW_ENDREASON_IDLE: Flow was terminated because it was idle
  • FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout
  • FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

 

Link to other CESNET datasets

Please cite the original data article:

@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }

Notes

The creation of the dataset was funded by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: ''Flow-Based Encrypted Traffic Analysis''.

Files

cesnet-quic22.zip

Files (21.0 GB)

Name Size Download all
md5:2d52ad44ddb22645a3464634fc54bb9c
21.0 GB Preview Download
md5:c99a8eb396b9598d3f5d110a562b118d
9.5 kB Preview Download