Published September 27, 2024 | Version v1
Dataset Open

CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting

  • 1. ROR icon Czech Education and Scientific Network
  • 2. ROR icon Czech Technical University in Prague

Description

CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.

Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.

Please cite the usage of our dataset as:

Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x

@Article{cesnettimeseries24,
    author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
    title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
    journal={Scientific Data},
    year={2025},
    month={Feb},
    day={26},
    volume={12},
    number={1},
    pages={338},
    issn={2052-4463},
    doi={10.1038/s41597-025-04603-x},
    url={https://doi.org/10.1038/s41597-025-04603-x}
}

 

Time series

We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.  

Datapoints created by the aggregation of IP flows contain the following time-series metrics:

  • Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)
  • Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.   
  • Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size
  • Average metrics: the average flow duration, and the average Time To Live (TTL)

 

Multiple time aggregation:  The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.

Time series of institutions:  We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data. 

Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data. 

 

Data Records

The file hierarchy is described below:

cesnet-timeseries24/

     |- institution_subnets/

     |     |- agg_10_minutes/<id_institution>.csv

     |     |- agg_1_hour/<id_institution>.csv

     |     |- agg_1_day/<id_institution>.csv

     |     |- identifiers.csv

     |- institutions/

     |     |- agg_10_minutes/<id_institution_subnet>.csv

     |     |- agg_1_hour/<id_institution_subnet>.csv

     |     |- agg_1_day/<id_institution_subnet>.csv

     |     |- identifiers.csv

     |- ip_addresses_full/

     |     |- agg_10_minutes/<id_ip_folder>/<id_ip>.csv

     |     |- agg_1_hour/<id_ip_folder>/<id_ip>.csv

     |     |- agg_1_day/<id_ip_folder>/<id_ip>.csv

     |     |- identifiers.csv

     |- ip_addresses_sample/

     |      |- agg_10_minutes/<id_ip>.csv

     |      |- agg_1_hour/<id_ip>.csv

     |      |- agg_1_day/<id_ip>.csv

     |      |- identifiers.csv

     |- times/

     |      |- times_10_minutes.csv

     |      |- times_1_hour.csv

     |      |- times_1_day.csv

     |- ids_relationship.csv
     |- weekends_and_holidays.csv

The following list describes time series data fields in CSV files:

  • id_time:  Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.
  • n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.
  • n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.
  • n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.
  • n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.
  • n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.
  • n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.
  • tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.
  • tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1>  with same rule as tcp_udp_ratio_packets.
  • dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.
  • dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.
  • avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.
  • avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.

Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ipn_dest_asn, and n_dest_port:

  • sum_n_dest_ip: Sum of numbers of unique destination IP addresses.
  • avg_n_dest_ip: The average number of unique destination IP addresses.
  • std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.
  • sum_n_dest_asn: Sum of numbers of unique destination ASNs.
  • avg_n_dest_asn: The average number of unique destination ASNs.
  • std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)
  • sum_n_dest_port: Sum of numbers of unique destination transport layer ports.
  • avg_n_dest_port:  The average number of unique destination transport layer ports.
  • std_n_dest_port: Standard deviation of numbers of unique destination transport layer ports.

 

Moreover, files  identifiers.csv in each dataset type contain IDs of time series that are present in the dataset. Furthermore, the ids_relationship.csv file contains a relationship between IP addresses, Institutions, and institution subnets. The weekends_and_holidays.csv contains information about the non-working days in the Czech Republic.

Files

ids_relationship.csv

Files (41.5 GB)

Name Size Download all
md5:f034e4e7f1844e36fa7d3a4a0cbfc600
3.8 MB Preview Download
md5:735b5ff436b6c67556cb67e2888b14ea
774.0 MB Download
md5:ab3e15fb8dc9b7120ddb2318795b6812
479.4 MB Download
md5:7c3d28b7b2d2a02430ee88f84fb823b6
40.0 GB Download
md5:08451dab2d1eddb29a79467b03289232
170.9 MB Download
md5:a03813763e07646ca38f17ffd53e549e
211.5 kB Download
md5:f8578ebd8a1bc95a7187fed2e436e5bd
1.8 kB Preview Download

Additional details

Funding

Ministry of the Interior
Flow-Based Encrypted Traffic Analysis VJ02010024