Published May 22, 2023 | Version v1
Dataset Open

Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset

  • 1. University of Oulu

Description

This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is  now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

 

To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

 

In order to extract the data, follow the following instructions:

  • Download and install bzip2 (if not already installed) from the official website or your package manager.
  • Place the compressed dataset file in a directory of your choice.
  • Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
  • Execute the following command to uncompress the dataset:
    • bzip2 -d filename.bz2
  • Replace "filename.bz2" with the actual name of the compressed dataset file.

Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

 

The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

 

Feature Description Example Value
ip.src Source IP address in the packet a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
ip.dst Destination IP address in the packet a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
frame.time_epoch Epoch time of the frame 1676165569.930869
arp.hw.type Hardware type 1
arp.hw.size Hardware size 6
arp.proto.size Protocol size 4
arp.opcode Opcode 2
data.len Length 2713
eth.dst.lg Destination LG bit 1
eth.dst.ig Destination IG bit 1
eth.src.lg Source LG bit 1
eth.src.ig Source IG bit 1
frame.offset_shift Time shift for this packet 0
frame.len frame length on the wire 1208
frame.cap_len Frame length stored into the capture file 215
frame.marked Frame is marked 0
frame.ignored Frame is ignored 0
frame.encap_type Encapsulation type 1
gre Generic Routing Encapsulation 'Generic Routing
Encapsulation (IP)’
ip.version Version 6
ip.hdr_len Header length 24
ip.dsfield.dscp Differentiated Services
Codepoint
56
ip.dsfield.ecn Explicit Congestion
Notification
2
ip.len Total length 614
ip.flags.rb Reserved bit 0
ip.flags.df Don't fragment 1
ip.flags.mf More fragments 0
ip.frag_offset Fragment offset 0
ip.ttl Time to live 31
ip.proto Protocol 47
ip.checksum.status Header checksum status 2
tcp.srcport TCP source port 53425
tcp.flags Flags 0x00000098
tcp.flags.ns Nonce 0
tcp.flags.cwr Congestion Window Reduced
(CWR)
1
udp.srcport UDP source port 64413
udp.dstport UDP destination port 54087
udp.stream Stream index 1345
udp.length Length 225
udp.checksum.status Checksum status 3
packet_type Type of the packet which is either "benign" or "malicious" 0

Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

 

Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

 

By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

Files

Files (16.6 GB)

Name Size Download all
md5:efa51d98de787c513e51322e3163fbe6
6.3 GB Download
md5:732f2798ff5b3218cf73d00483c42d20
4.0 GB Download
md5:8be33ac338d293ac837361216373fdf5
6.3 GB Download

Additional details

Funding

European Commission
IDUNN – A Cognitive Detection System for Cybersecure Operational Technologies 101021911