Network Traffic Analysis: Data and Code

Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric

doi:10.5281/zenodo.11623345

Published June 2024 | Version v2

Dataset Open

Network Traffic Analysis: Data and Code

1. Loyola University Chicago

Code:

Packet_Features_Generator.py & Features.py
- To run this code:
  - pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j
  - -h, --help show this help message and exit
    -i TXTFILE input text file
    -x X Add first X number of total packets as features.
    -y Y Add first Y number of negative packets as features.
    -z Z Add first Z number of positive packets as features.
    -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,...
    -s S Generate samples using size s.
    -j
- Purpose:
  - Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.
  - Uses Features.py to calcualte the features.
startMachineLearning.sh & machineLearning.py
- To run this code:
  - bash startMachineLearning.sh
  - This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags
  - Options (to be edited within this file):
    - --evaluate-only to test 5 fold cross validation accuracy
    - --test-scaling-normalization to test 6 different combinations of scalers and normalizers
      - Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use
    - --grid-search to test the best grid search hyperparameters
      - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:'
      - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'
- Purpose:
  - Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

Data

Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.
Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:
- First number is a classification number to denote what website, query, or vr action is taking place.
- The remaining numbers in each line denote:
  - The size of a packet,
  - and the direction it is traveling.
- negative numbers denote incoming packets
- positive numbers denote outgoing packets

Figure 4 Data

This data uses specific lines from the Virtual Reality.txt file.
- The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.
- The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.
The .xlsx and .csv file are identical
Each file includes (from right to left):
- The origional packet data,
- each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,
- and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.

Files

Figure4 Data.csv

Files (691.7 kB)

Name	Size	Download all
Figure4 Data.csv md5:c541190d525cbe7d9979a0b636013480	575.6 kB	Preview Download
Figure4 Data.xlsx md5:be995a7fe27f9eae8c4d90ba82c0c2bc	116.1 kB	Download

Additional details

Programming language: Python , Linux Kernel Module

	All versions	This version
Views	250	122
Downloads	317	115
Data volume	18.2 GB	59.4 MB

Network Traffic Analysis: Data and Code

Authors/Creators

Description

Files

Figure4 Data.csv

Files (691.7 kB)

Additional details

Software