Published May 1, 2020 | Version 1.0
Dataset Open

Host Network Traffic 2019

Creators

  • 1. ICS, Masaryk University

Description

Dataset Summary

  • Timespan: 2019-01-01 : 2019-12-31
  • Granularity: 1-hour disjoint time windows
  • # of characteristics observed: 9
  • Hosts observed: 65536
  • Labels: included
  • Unzipped volume: approx. 10 GB

Dataset Origins

Dataset was collected over the whole year  2019. The observation points for the collection of IP flows were located at the borders of the university campus network. The campus university network has /16 CIDR IPv4 network range at disposal and contains various network segments from segments connecting dormitories, over server segments, to a segment containing working stations of university administrative workers. A host in our dataset is identified by its source IPv4 address.  

Variables

The dataset contains the following variables:

  • Aggregations - created sums of the individual variables over a one-hour interval:
    • # of flows  - number of flows for a given source IP 
    • # of packets  - number of packets for a given source IP
    • # of bytes  - number of packets for a given source IP
    • flow duration  - average flow duration in seconds
  • Distinct Counts - count of distinct values for each variable over a one-hour window
    • # of peers  - number of distinct communication peers for a given source IP
    • # of ports  - number of distinct destination ports for a given source IP
    • # of protocols - number of distinct communication protocols for a given source IP
    • # of AS numbers - number of distinct destination AS numbers for a given source IP
    • # of countries  - number of distinct destination countries for a given source 

Dataset Structure

  • Dataset Files - each variable is contained in one Comma-Separated File (.csv) file
    • Row index -  timestamp of the observation window (8760 rows)
    • Columns index -  anonymized IP addresses (65536 columns)
  • Label File - contains labels of the individual IP addresses from the Dataset Files
    • Row index - anonymized IP addresses (65536 rows)
    • Columns index - labels for the IP addresses
      • Subnet - ID of a subnet - hosts belonging to the same subnet have the same Id.
      • Subnet_range - CIDR range of a subnet
      • Unit - an ID of  administrative unit owning the network range
      • Sub-unit  - an ID of  administrative sub-unit owning the network range
      • Subnet_label -  subnet label
        • Servers - selected subnets containing mostly servers (133.250.178.0/24, 133.250.163.0/24)
        • Workstations - selected subnets containing mostly workstations (133.250.146.0/24, 133.250.157.128/25)

Further notes

  • N/A values
    • Variables - means that in a given observation window, the host did not communicate
    • Labels - no additional information on this IP is available
  • Dataset load 
    • df = pd.read_csv(<filename>,header=[0], index_col=[0])

Files

Files (1.6 GB)

Name Size Download all
md5:0775fd4e5b18da80be448a2673757bc4
1.6 GB Download

Additional details

Funding

SAPPAN – Sharing and Automation for Privacy Preserving Attack Neutralization 833418
European Commission