Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset
Description
This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.
The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.
To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.
In order to extract the data, follow the following instructions:
- Download and install bzip2 (if not already installed) from the official website or your package manager.
- Place the compressed dataset file in a directory of your choice.
- Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
- Execute the following command to uncompress the dataset:
- bzip2 -d filename.bz2
- Replace "filename.bz2" with the actual name of the compressed dataset file.
Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.
The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.
Feature | Description | Example Value |
---|---|---|
ip.src | Source IP address in the packet | a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17 |
ip.dst | Destination IP address in the packet | a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5 |
frame.time_epoch | Epoch time of the frame | 1676165569.930869 |
arp.hw.type | Hardware type | 1 |
arp.hw.size | Hardware size | 6 |
arp.proto.size | Protocol size | 4 |
arp.opcode | Opcode | 2 |
data.len | Length | 2713 |
eth.dst.lg | Destination LG bit | 1 |
eth.dst.ig | Destination IG bit | 1 |
eth.src.lg | Source LG bit | 1 |
eth.src.ig | Source IG bit | 1 |
frame.offset_shift | Time shift for this packet | 0 |
frame.len | frame length on the wire | 1208 |
frame.cap_len | Frame length stored into the capture file | 215 |
frame.marked | Frame is marked | 0 |
frame.ignored | Frame is ignored | 0 |
frame.encap_type | Encapsulation type | 1 |
gre | Generic Routing Encapsulation | 'Generic Routing Encapsulation (IP)’ |
ip.version | Version | 6 |
ip.hdr_len | Header length | 24 |
ip.dsfield.dscp | Differentiated Services Codepoint |
56 |
ip.dsfield.ecn | Explicit Congestion Notification |
2 |
ip.len | Total length | 614 |
ip.flags.rb | Reserved bit | 0 |
ip.flags.df | Don't fragment | 1 |
ip.flags.mf | More fragments | 0 |
ip.frag_offset | Fragment offset | 0 |
ip.ttl | Time to live | 31 |
ip.proto | Protocol | 47 |
ip.checksum.status | Header checksum status | 2 |
tcp.srcport | TCP source port | 53425 |
tcp.flags | Flags | 0x00000098 |
tcp.flags.ns | Nonce | 0 |
tcp.flags.cwr | Congestion Window Reduced (CWR) |
1 |
udp.srcport | UDP source port | 64413 |
udp.dstport | UDP destination port | 54087 |
udp.stream | Stream index | 1345 |
udp.length | Length | 225 |
udp.checksum.status | Checksum status | 3 |
packet_type | Type of the packet which is either "benign" or "malicious" | 0 |
Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.
Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.
By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.
Files
Files
(16.6 GB)
Name | Size | Download all |
---|---|---|
md5:efa51d98de787c513e51322e3163fbe6
|
6.3 GB | Download |
md5:732f2798ff5b3218cf73d00483c42d20
|
4.0 GB | Download |
md5:8be33ac338d293ac837361216373fdf5
|
6.3 GB | Download |