VPN-nonVPN-Dataset
Authors/Creators
Description
A flow-level dataset of WireGuard tunnel traffic with matched encrypted-side features and application labels
A flow-level dataset derived from paired captures on both sides of a WireGuard VPN tunnel. Each flow combines NFStream-exported pre-tunnel attributes (including application labels from deep packet inspection) with encrypted-side statistics computed from matched WireGuard transport packets.
Dataset summary
| Property | Value |
| Capture sessions | 2 |
| Total duration | ~80 hours |
| Total flows | 226,454 (122,975 + 103,479) |
| Columns per flow | 126 |
| Unique application names | 81 / 75 per session |
| Unique application categories | 20 / 19 per session |
| Protocol split | ~47% TCP, ~53% UDP |
| Capture location | Budapest, Hungary |
| VPN endpoint | Surfshark WireGuard, Prague, Czech Republic |
| Active devices | 10 |
Repository structure
VPN-nonVPN-Dataset
- Data
- session_1
- session1_flows.parquet: Session 1 flow records 122,975 rows
-
session1_nfstream_inner_flows.parquet: Session 1 inner side NFStream flow export
-
session1_packet_matches.parquet: Session 1 matched inner–outer packet pairs
- session_2
-
session2_flows.parquet : Session 2 flow records 103,479 rows
-
session2_nfstream_inner_flows.parquet: Session 2 inner side NFStream flow export
-
session2_packet_matches.parquet: Session 2 matched inner–outer packet pairs
-
- session_1
- Code
- packet_matching.py : Phase 3 inner outer packet matching
- flow_matching.py: Phase 3 packet to flow assignment and aggregation
- compress_parquets.py: Compresses the Parquet files using ZSTD to reduce file size, after converting CSV files to Parquet.
- LICENSE
-
Data_LICENSE.txt: CC BY 4.0 License for data files
-
Code_LICENSE.txt: MIT License for scripts
-
- Validation
-
Dataset_summary_statistics.ipynb: Reproduces dataset summary statistics and manuscript tables from released data
-
Matching_coverage.ipynb: Reproduces matching coverage results and figures
-
Worked_example_random_row.csv: Worked example flow record exported as column name/value table
-
checksums_sha256.txt: SHA-256 hashes for raw PCAP inputs and released Parquet files
-
Data files
Both Parquet files share the same 126-column schema. Each row corresponds to one bidirectional flow.
| File | Rows | Capture date | Duration |
| session1_flows.parquet | 122,975 | 2025-12-01 | ~48 h |
| session2_flows.parquet | 103,479 | 2025-12-04 | ~32 h |
Column schema
Each flow record contains 126 columns organized into three groups.
NFStream flow attributes (columns 1-89)
Standard flow fields exported by NFStream, including:
- Identifiers: id, expiration_id, src_ip, dst_ip, src_port, dst_port, protocol, ip_version
- Timing: bidirectional_first_seen_ms, bidirectional_last_seen_ms, bidirectional_duration_ms, and per-direction equivalents (src2dst_*, dst2src_*)
- Volume: bidirectional_packets, bidirectional_bytes, and per-direction equivalents
- Packet size statistics: bidirectional_min_ps, bidirectional_mean_ps, bidirectional_stddev_ps, bidirectional_max_ps, and per-direction equivalents
- Inter-arrival time statistics: bidirectional_min_piat_ms, bidirectional_mean_piat_ms, bidirectional_stddev_piat_ms, bidirectional_max_piat_ms, and per-direction equivalents
- TCP flags: per-flag packet counts (SYN, ACK, FIN, RST, PSH, URG, CWR, ECE) for bidirectional and per-direction
- SPLT arrays (first 255 packets per flow):
- splt_direction — packet direction sequence
- splt_ps — packet size sequence (bytes)
- splt_piat_ms — inter-arrival time sequence (milliseconds)
- Application labels:
- application_name — nDPI-assigned application name (e.g., DNS, TLS.Facebook, QUIC)
- application_category_name — nDPI-assigned category (e.g., Web, Network, Chat)
- application_is_guessed, application_confidence
- DPI metadata: requested_server_name, client_fingerprint, server_fingerprint, user_agent, content_type
Flow assignment metadata (columns 90-94)
- flow_id — flow identifier
- flow_start_ms, flow_end_ms — flow time boundaries in milliseconds
- k5_fwd, k5_rev — forward and reverse 5-tuple keys
Encrypted-side derived features (columns 95-126)
Statistics computed from matched WireGuard tunnel packets (the table summarizes the encrypted-side features added to each flow and maps them to the closest NFStream attributes, supporting structured comparison and feature selection for learning tasks based solely on tunnel-side statistics)
|
Encrypted-side column |
Description |
Nearest NFStream counterpart |
|
matched_packets |
Matched packet pairs assigned to flow |
bidirectional_packets |
|
outer_bytes |
Sum of outer padded lengths |
bidirectional_bytes |
|
first_matched_time_ms |
Earliest inner time of assigned packets |
bidirectional_first_seen_ms |
|
last_matched_time_ms |
Latest inner time of assigned packets |
bidirectional_last_seen_ms |
|
outer_duration_ms |
Matched duration on inner timeline |
bidirectional_duration_ms |
|
outer_first_matched_time_ms |
Earliest outer time of assigned packets |
bidirectional_first_seen_ms (inner-side equivalent) |
|
outer_last_matched_time_ms |
Latest outer time of assigned packets |
bidirectional_last_seen_ms (inner-side equivalent) |
|
outer_capture_duration_ms |
Matched duration on outer timeline |
bidirectional_duration_ms (outer-side equivalent) |
|
mean_outer_pkt_size |
Mean outer padded length |
bidirectional_mean_ps |
|
std_outer_pkt_size |
Std dev of outer padded length |
bidirectional_stddev_ps |
|
outer_packet_rate |
Matched packets per second |
bidirectional_packets / (bidirectional_duration_ms / 1000) |
|
outer_byte_rate |
Outer bytes per second |
bidirectional_bytes / (bidirectional_duration_ms / 1000) |
|
outer_bytes_in |
Outer padded length sum, inbound |
dst2src_bytes |
|
outer_bytes_out |
Outer padded length sum, outbound |
src2dst_bytes |
|
mean_size_ratio |
Mean of outer_padded_length / inner_length |
No direct counterpart (cross-view ratio) |
|
std_size_ratio |
Std dev of size ratio |
No direct counterpart |
|
max_size_ratio |
Maximum size ratio |
No direct counterpart |
|
outer_min_piat_ms_in |
Min inter-arrival time, inbound |
dst2src_min_piat_ms |
|
outer_mean_piat_ms_in |
Mean inter-arrival time, inbound |
dst2src_mean_piat_ms |
|
outer_stddev_piat_ms_in |
Std dev of inter-arrival time, inbound |
dst2src_stddev_piat_ms |
|
outer_max_piat_ms_in |
Max inter-arrival time, inbound |
dst2src_max_piat_ms |
|
outer_min_piat_ms_out |
Min inter-arrival time, outbound |
src2dst_min_piat_ms |
|
outer_mean_piat_ms_out |
Mean inter-arrival time, outbound |
src2dst_mean_piat_ms |
|
outer_stddev_piat_ms_out |
Std dev of inter-arrival time, outbound |
src2dst_stddev_piat_ms |
|
outer_max_piat_ms_out |
Max inter-arrival time, outbound |
src2dst_max_piat_ms |
|
outer_min_piat_ms |
Min inter-arrival time, pooled |
bidirectional_min_piat_ms |
|
outer_mean_piat_ms |
Mean inter-arrival time, pooled |
bidirectional_mean_piat_ms |
|
outer_stddev_piat_ms |
Std dev of inter-arrival time, pooled |
bidirectional_stddev_piat_ms |
|
outer_max_piat_ms |
Max inter-arrival time, pooled |
bidirectional_max_piat_ms |
|
outer_splt_direction |
First 255 outer packet directions |
splt_direction |
|
outer_splt_ps |
First 255 outer packet sizes (bytes) |
splt_ps |
|
outer_splt_piat_ms |
First 255 outer packet PIAT values (ms) |
splt_piat_ms |
Processing code
The code directory contains the scripts used to produce the dataset from raw PCAP captures. These are provided for transparency and reproducibility.
packet_matching.py
Matches inner-side pre-tunnel packets to outer-side encrypted WireGuard transport data packets using time alignment and a padded-length consistency rule.
Requirements: Python 3.8+, tshark (Wireshark CLI)
python code/packet_matching.py \ --inner inner_capture.pcap \ --outer outer_capture.pcap \ --time-tolerance 15 \ --output packet_matches.csv
flow_matching.py
Assigns matched packets to NFStream-exported flows using 5-tuple keys with temporal and capacity constraints, then aggregates encrypted-side statistics per flow.
Requirements: Python 3.8+, pandas, numpy
python code/flow_matching.py \ --packets packet_matches.csv \ --flows nfstream_inner_flow.csv \ --output session_flows
Dataset generation pipeline
The dataset was produced through a three-phase pipeline:
- Traffic capture: Paired PCAP captures on the WireGuard tunnel interface (inner, pre-tunnel) and via an inline network TAP (outer, encrypted). NIC offloads were disabled. Nanosecond timestamp precision was enabled.
- PCAP cleaning: Inner captures were filtered to retain only TCP and UDP packets and to remove non-initial IPv4 fragments.
- Matching, flow export, and aggregation:
- packet_matching.py — links inner packets to outer WireGuard transport data packets
- NFStream — exports flow records from cleaned inner PCAPs with application labeling via nDPI
- flow_matching.py — assigns matched packets to flows and aggregates encrypted-side features
Measurement topology
Traffic was captured on a residential broadband connection in Budapest, Hungary. A GL.iNet Flint 2 (GL-MT6000) router served as the WireGuard VPN client gateway. A network TAP was placed inline between the router and the ISP router, mirroring encrypted tunnel traffic to a Linux capture host.
This release differs from the previous version only in file format: all intermediate outputs (NFStream flow exports and packet-matching results) are now distributed as Parquet rather than CSV, reducing repository size while preserving identical underlying data.
Citation
If you use this dataset, please cite:
Razooqi, Y. S., & Pekar, A. (2026). A flow-level dataset of WireGuard tunnel traffic with matched encrypted-side features and application labels. Data in Brief, 112696. https://doi.org/10.1016/j.dib.2026.112696
Razooqi, Y. S., & Pekar, A. (2026). VPN-nonVPN-Dataset (v3.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18945858
Reuse terms
Parquet data are licensed under CC BY 4.0, and the scripts are licensed under the MIT License, see LICENSE/Code_LICENSE.txt and LICENSE/Data_LICENSE.txt
Files
VPN-nonVPN-Dataset.zip
Files
(1.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:795f2b09f78bd034ab33c6ae690c39a8
|
1.3 GB | Preview Download |
Additional details
Dates
- Collected
-
2025-12-01
Software
- Repository URL
- https://zenodo.org/records/18700746
- Programming language
- Python