Dataset was collected over one month period in January 2019. The observation points for the collection of IP flows were located at the borders of the university campus network. The campus university network has /16 CIDR IPv4 network range at disposal and contains various network segments from segments connecting dormitories, over server segments, to a segment containing working stations of university administrative workers. The size of the raw IP flows used to create the dataset was over 860GB. A host in our dataset is identified by its source IPv4 address.  


The dataset contains the following variables:

  • Aggregations - created from five-minute total volumes aggregated over one-hour disjoint windows using mean/max/min aggregation functions
    • # of flows (FL) - number of flows for a given source IP 
    • # of packets (PKT) - number of packets for a given source IP
    • # of bytes (BYT) - number of packets for a given source IP
    • flow duration (DUR) - average flow duration in seconds
  • Distinct Counts - count of distinct values for each variable in five-minute window aggregated over one-hour disjoint windows using mean/max/min aggregation functions
    • # of peers (PEER) - number of distinct communication peers for a given source IP
    • # of ports (PORTS) - number of distinct destination ports for a given source IP
    • # of protocols (PROTO) - number of distinct communication protocols for a given source IP
    • # of AS numbers (AS) - number of distinct destination AS numbers for a given source IP
    • # of countries (CTRY) - number of distinct destination countries for a given source IP
  • Labels
    • Range (RNG) - a network range a host belongs to (anonymized)
    • Unit (UNT) - an administrative unit owning the network range
    • Sub-unit (SUB-UNT) - a sub-unit of the unit


Dataset format

  • The dataset is in comma-separated values (CSV) format. 
  • Header - multilevel, first 3 lines
    • 1 level - aggregation type {mean|min|max}
    • 2 level - variable {see above}
    • 3 level - hour of a day {00,01,02,03,...,22,23}
  • Lablels - last 4 columns
  • Dataset size 
    • rows: 65536 host records + 3 headers
    • columns: 648 variables + 4 labels


