Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset

Bakhshi Zadi Mahmoodi, Alireza; Panos Kostakos

doi:10.5281/zenodo.7956304

Published May 22, 2023 | Version v1

Dataset Open

Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset

1. University of Oulu

This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

In order to extract the data, follow the following instructions:

Download and install bzip2 (if not already installed) from the official website or your package manager.
Place the compressed dataset file in a directory of your choice.
Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
Execute the following command to uncompress the dataset:
- bzip2 -d filename.bz2
Replace "filename.bz2" with the actual name of the compressed dataset file.

Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

Feature	Description	Example Value
ip.src	Source IP address in the packet	a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
ip.dst	Destination IP address in the packet	a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
frame.time_epoch	Epoch time of the frame	1676165569.930869
arp.hw.type	Hardware type	1
arp.hw.size	Hardware size	6
arp.proto.size	Protocol size	4
arp.opcode	Opcode	2
data.len	Length	2713
eth.dst.lg	Destination LG bit	1
eth.dst.ig	Destination IG bit	1
eth.src.lg	Source LG bit	1
eth.src.ig	Source IG bit	1
frame.offset_shift	Time shift for this packet	0
frame.len	frame length on the wire	1208
frame.cap_len	Frame length stored into the capture file	215
frame.marked	Frame is marked	0
frame.ignored	Frame is ignored	0
frame.encap_type	Encapsulation type	1
gre	Generic Routing Encapsulation	'Generic Routing Encapsulation (IP)’
ip.version	Version	6
ip.hdr_len	Header length	24
ip.dsfield.dscp	Differentiated Services Codepoint	56
ip.dsfield.ecn	Explicit Congestion Notification	2
ip.len	Total length	614
ip.flags.rb	Reserved bit	0
ip.flags.df	Don't fragment	1
ip.flags.mf	More fragments	0
ip.frag_offset	Fragment offset	0
ip.ttl	Time to live	31
ip.proto	Protocol	47
ip.checksum.status	Header checksum status	2
tcp.srcport	TCP source port	53425
tcp.flags	Flags	0x00000098
tcp.flags.ns	Nonce	0
tcp.flags.cwr	Congestion Window Reduced (CWR)	1
udp.srcport	UDP source port	64413
udp.dstport	UDP destination port	54087
udp.stream	Stream index	1345
udp.length	Length	225
udp.checksum.status	Checksum status	3
packet_type	Type of the packet which is either "benign" or "malicious"	0

Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

Files

Files (16.6 GB)

Name	Size	Download all
validation_data-batch0-anonymized.csv.bz2 md5:efa51d98de787c513e51322e3163fbe6	6.3 GB	Download
validation_data-batch1-anonymized.csv.bz2 md5:732f2798ff5b3218cf73d00483c42d20	4.0 GB	Download
validation_data-batch2-anonymized.csv.bz2 md5:8be33ac338d293ac837361216373fdf5	6.3 GB	Download

Additional details

European Commission
IDUNN – A Cognitive Detection System for Cybersecure Operational Technologies 101021911

	All versions	This version
Views	557	548
Downloads	76	76
Data volume	457.7 GB	457.7 GB

Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset

Creators

Description

Files

Files (16.6 GB)

Additional details

Funding