Published March 11, 2026 | Version v3.0.0
Dataset Open

VPN-nonVPN-Dataset

  • 1. ROR icon Budapest University of Technology and Economics

Description

A flow-level dataset of WireGuard tunnel traffic with matched encrypted-side features and application labels

A flow-level dataset derived from paired captures on both sides of a WireGuard VPN tunnel. Each flow combines NFStream-exported pre-tunnel attributes (including application labels from deep packet inspection) with encrypted-side statistics computed from matched WireGuard transport packets.

Dataset summary

Property Value
Capture sessions 2
Total duration ~80 hours
Total flows 226,454 (122,975 + 103,479)
Columns per flow 126
Unique application names  81 / 75 per session 
Unique application categories 20 / 19 per session
Protocol split ~47% TCP, ~53% UDP
Capture location  Budapest, Hungary
VPN endpoint  Surfshark WireGuard, Prague, Czech Republic
Active devices 10

Repository structure

VPN-nonVPN-Dataset

  •   Data
    • session_1
      • session1_flows.parquet:    Session 1 flow records 122,975 rows
      • session1_nfstream_inner_flows.parquet:   Session 1 inner side NFStream flow export

      • session1_packet_matches.parquet:   Session 1 matched inner–outer packet pairs

    • session_2
      • session2_flows.parquet :  Session 2 flow records 103,479 rows

      • session2_nfstream_inner_flows.parquet:   Session 2 inner side NFStream flow export

      • session2_packet_matches.parquet:   Session 2 matched inner–outer packet pairs

  •   Code
    • packet_matching.py :  Phase 3 inner outer packet matching
    • flow_matching.py:       Phase 3 packet to flow assignment and aggregation
    • compress_parquets.py:  Compresses the Parquet files using ZSTD to reduce file size, after converting CSV files to Parquet.
  •   LICENSE
    • Data_LICENSE.txt:   CC BY 4.0 License for data files

    • Code_LICENSE.txt:   MIT License for scripts

  • Validation
    • Dataset_summary_statistics.ipynb:  Reproduces dataset summary statistics and manuscript tables from released data

    • Matching_coverage.ipynb:   Reproduces matching coverage results and figures

    •  Worked_example_random_row.csv:  Worked example flow record exported as column name/value table

    • checksums_sha256.txt:   SHA-256 hashes for raw PCAP inputs and released Parquet files

Data files

Both Parquet files share the same 126-column schema. Each row corresponds to one bidirectional flow.

File Rows Capture date  Duration
session1_flows.parquet 122,975  2025-12-01  ~48 h
session2_flows.parquet  103,479 2025-12-04   ~32 h

Column schema

Each flow record contains 126 columns organized into three groups.

NFStream flow attributes (columns 1-89)

Standard flow fields exported by NFStream, including:

  • Identifiers: id, expiration_id, src_ip, dst_ip, src_port, dst_port, protocol, ip_version
  • Timing: bidirectional_first_seen_ms, bidirectional_last_seen_ms, bidirectional_duration_ms, and per-direction equivalents (src2dst_*, dst2src_*)
  • Volume: bidirectional_packets, bidirectional_bytes, and per-direction equivalents
  • Packet size statistics: bidirectional_min_ps, bidirectional_mean_ps, bidirectional_stddev_ps, bidirectional_max_ps, and per-direction equivalents
  • Inter-arrival time statistics: bidirectional_min_piat_ms, bidirectional_mean_piat_ms, bidirectional_stddev_piat_ms, bidirectional_max_piat_ms, and per-direction equivalents
  • TCP flags: per-flag packet counts (SYN, ACK, FIN, RST, PSH, URG, CWR, ECE) for bidirectional and per-direction
  • SPLT arrays (first 255 packets per flow):
    • splt_direction — packet direction sequence
    • splt_ps — packet size sequence (bytes)
    • splt_piat_ms — inter-arrival time sequence (milliseconds)
  • Application labels:
    • application_name — nDPI-assigned application name (e.g., DNS, TLS.Facebook, QUIC)
    • application_category_name — nDPI-assigned category (e.g., Web, Network, Chat)
    • application_is_guessed, application_confidence
  • DPI metadata: requested_server_name, client_fingerprint, server_fingerprint, user_agent, content_type

Flow assignment metadata (columns 90-94)

  • flow_id — flow identifier
  • flow_start_ms, flow_end_ms — flow time boundaries in milliseconds
  • k5_fwd, k5_rev — forward and reverse 5-tuple keys

Encrypted-side derived features (columns 95-126)

Statistics computed from matched WireGuard tunnel packets (the table summarizes the encrypted-side features added to each flow and maps them to the closest NFStream attributes, supporting structured comparison and feature selection for learning tasks based solely on tunnel-side statistics)

Encrypted-side column  

Description  

Nearest NFStream counterpart  

matched_packets  

Matched packet pairs assigned to flow  

bidirectional_packets  

outer_bytes  

Sum of outer padded lengths  

bidirectional_bytes  

first_matched_time_ms  

Earliest inner time of assigned packets  

bidirectional_first_seen_ms  

last_matched_time_ms  

Latest inner time of assigned packets  

bidirectional_last_seen_ms  

outer_duration_ms  

Matched duration on inner timeline  

bidirectional_duration_ms  

outer_first_matched_time_ms  

Earliest outer time of assigned packets  

bidirectional_first_seen_ms (inner-side equivalent)  

outer_last_matched_time_ms  

Latest outer time of assigned packets  

bidirectional_last_seen_ms (inner-side equivalent)  

outer_capture_duration_ms  

Matched duration on outer timeline  

bidirectional_duration_ms (outer-side equivalent)  

mean_outer_pkt_size  

Mean outer padded length  

bidirectional_mean_ps  

std_outer_pkt_size  

Std dev of outer padded length  

bidirectional_stddev_ps  

outer_packet_rate  

Matched packets per second  

bidirectional_packets /  (bidirectional_duration_ms / 1000)  

outer_byte_rate  

Outer bytes per second  

bidirectional_bytes /  (bidirectional_duration_ms / 1000)  

outer_bytes_in  

Outer padded length sum, inbound  

dst2src_bytes  

outer_bytes_out  

Outer padded length sum, outbound  

src2dst_bytes  

mean_size_ratio  

Mean of outer_padded_length / inner_length  

No direct counterpart (cross-view ratio)  

std_size_ratio  

Std dev of size ratio  

No direct counterpart  

max_size_ratio  

Maximum size ratio  

No direct counterpart  

outer_min_piat_ms_in  

Min inter-arrival time, inbound  

dst2src_min_piat_ms  

outer_mean_piat_ms_in  

Mean inter-arrival time, inbound  

dst2src_mean_piat_ms  

outer_stddev_piat_ms_in  

Std dev of inter-arrival time, inbound  

dst2src_stddev_piat_ms  

outer_max_piat_ms_in  

Max inter-arrival time, inbound  

dst2src_max_piat_ms  

outer_min_piat_ms_out  

Min inter-arrival time, outbound  

src2dst_min_piat_ms  

outer_mean_piat_ms_out  

Mean inter-arrival time, outbound  

src2dst_mean_piat_ms  

outer_stddev_piat_ms_out  

Std dev of inter-arrival time, outbound  

src2dst_stddev_piat_ms  

outer_max_piat_ms_out  

Max inter-arrival time, outbound  

src2dst_max_piat_ms  

outer_min_piat_ms  

Min inter-arrival time, pooled  

bidirectional_min_piat_ms  

outer_mean_piat_ms  

Mean inter-arrival time, pooled  

bidirectional_mean_piat_ms  

outer_stddev_piat_ms  

Std dev of inter-arrival time, pooled  

bidirectional_stddev_piat_ms  

outer_max_piat_ms  

Max inter-arrival time, pooled  

bidirectional_max_piat_ms 

outer_splt_direction

First 255 outer packet directions

splt_direction

outer_splt_ps

First 255 outer packet sizes (bytes)

splt_ps

outer_splt_piat_ms

First 255 outer packet PIAT values (ms)

splt_piat_ms

Processing code

The code directory contains the scripts used to produce the dataset from raw PCAP captures. These are provided for transparency and reproducibility.

packet_matching.py

Matches inner-side pre-tunnel packets to outer-side encrypted WireGuard transport data packets using time alignment and a padded-length consistency rule.

Requirements: Python 3.8+, tshark (Wireshark CLI)


python code/packet_matching.py \
    --inner inner_capture.pcap \
    --outer outer_capture.pcap \
    --time-tolerance 15 \
    --output packet_matches.csv

flow_matching.py

Assigns matched packets to NFStream-exported flows using 5-tuple keys with temporal and capacity constraints, then aggregates encrypted-side statistics per flow.

Requirements: Python 3.8+, pandas, numpy

python code/flow_matching.py \
     --packets packet_matches.csv \
     --flows nfstream_inner_flow.csv \
     --output session_flows

Dataset generation pipeline

The dataset was produced through a three-phase pipeline:

  1. Traffic capture: Paired PCAP captures on the WireGuard tunnel interface (inner, pre-tunnel) and via an inline network TAP (outer, encrypted). NIC offloads were disabled. Nanosecond timestamp precision was enabled.
  2. PCAP cleaning: Inner captures were filtered to retain only TCP and UDP packets and to remove non-initial IPv4 fragments.
  3. Matching, flow export, and aggregation:
    • packet_matching.py — links inner packets to outer WireGuard transport data packets
    • NFStream — exports flow records from cleaned inner PCAPs with application labeling via nDPI
    • flow_matching.py — assigns matched packets to flows and aggregates encrypted-side features

Measurement topology

Traffic was captured on a residential broadband connection in Budapest, Hungary. A GL.iNet Flint 2 (GL-MT6000)  router served as the WireGuard VPN client gateway. A network TAP was placed inline between the router and the ISP router, mirroring encrypted tunnel traffic to a Linux capture host.

This release differs from the previous version only in file format: all intermediate outputs (NFStream flow exports and packet-matching results) are now distributed as Parquet rather than CSV, reducing repository size while preserving identical underlying data.

Citation

If you use this dataset, please cite:

Razooqi, Y. S., & Pekar, A. (2026). A flow-level dataset of WireGuard tunnel traffic with matched encrypted-side features and application labels. Data in Brief, 112696. https://doi.org/10.1016/j.dib.2026.112696

Razooqi, Y. S., & Pekar, A. (2026). VPN-nonVPN-Dataset (v3.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18945858

Reuse terms

Parquet data are licensed under CC BY 4.0, and the scripts are licensed under the MIT License, see LICENSE/Code_LICENSE.txt and LICENSE/Data_LICENSE.txt

 

Files

VPN-nonVPN-Dataset.zip

Files (1.3 GB)

Name Size Download all
md5:795f2b09f78bd034ab33c6ae690c39a8
1.3 GB Preview Download

Additional details

Dates

Collected
2025-12-01

Software

Repository URL
https://zenodo.org/records/18700746
Programming language
Python