Dataset Open Access

Host Network Traffic 2019

Tomas Jirsik

Dataset Summary

  • Timespan: 2019-01-01 : 2019-12-31
  • Granularity: 1-hour disjoint time windows
  • # of characteristics observed: 9
  • Hosts observed: 65536
  • Labels: included
  • Unzipped volume: approx. 10 GB

Dataset Origins

Dataset was collected over the whole year  2019. The observation points for the collection of IP flows were located at the borders of the university campus network. The campus university network has /16 CIDR IPv4 network range at disposal and contains various network segments from segments connecting dormitories, over server segments, to a segment containing working stations of university administrative workers. A host in our dataset is identified by its source IPv4 address.  

Variables

The dataset contains the following variables:

  • Aggregations - created sums of the individual variables over a one-hour interval:
    • # of flows  - number of flows for a given source IP 
    • # of packets  - number of packets for a given source IP
    • # of bytes  - number of packets for a given source IP
    • flow duration  - average flow duration in seconds
  • Distinct Counts - count of distinct values for each variable over a one-hour window
    • # of peers  - number of distinct communication peers for a given source IP
    • # of ports  - number of distinct destination ports for a given source IP
    • # of protocols - number of distinct communication protocols for a given source IP
    • # of AS numbers - number of distinct destination AS numbers for a given source IP
    • # of countries  - number of distinct destination countries for a given source 

Dataset Structure

  • Dataset Files - each variable is contained in one Comma-Separated File (.csv) file
    • Row index -  timestamp of the observation window (8760 rows)
    • Columns index -  anonymized IP addresses (65536 columns)
  • Label File - contains labels of the individual IP addresses from the Dataset Files
    • Row index - anonymized IP addresses (65536 rows)
    • Columns index - labels for the IP addresses
      • Subnet - ID of a subnet - hosts belonging to the same subnet have the same Id.
      • Subnet_range - CIDR range of a subnet
      • Unit - an ID of  administrative unit owning the network range
      • Sub-unit  - an ID of  administrative sub-unit owning the network range
      • Subnet_label -  subnet label
        • Servers - selected subnets containing mostly servers (133.250.178.0/24, 133.250.163.0/24)
        • Workstations - selected subnets containing mostly workstations (133.250.146.0/24, 133.250.157.128/25)

Further notes

  • N/A values
    • Variables - means that in a given observation window, the host did not communicate
    • Labels - no additional information on this IP is available
  • Dataset load 
    • df = pd.read_csv(<filename>,header=[0], index_col=[0])
Files (1.6 GB)
Name Size
host-network-traffic-2019.tar.gz
md5:0775fd4e5b18da80be448a2673757bc4
1.6 GB Download
411
141
views
downloads
All versions This version
Views 411411
Downloads 141141
Data volume 231.1 GB231.1 GB
Unique views 357357
Unique downloads 111111

Share

Cite as