Published April 28, 2026 | Version v1
Dataset Open

UCSB netFound Pretraining Data Sampler

  • 1. ROR icon University of California, Santa Barbara

Description

This is a sampler of pretraining data for netFound network foundation model. This data contains network packet headers in .arrow format and excludes payload or IP addresses, as per netFound preprocessing pipeline. This data is supposed to be used with netFound tokenizer (or any derivative tokenizers).

Please, see https://github.com/SNL-UCSB/netFound for instructions on usage and location of the full pretraining dataset.

Data facts:

  • Total collection time of the sampler: 2hrs.
  • Total network flows in the sampler: 60,199,405

Files

Files (15.8 GB)

Name Size Download all
md5:b5a5c1c8316cb30a6c1750ff85921b00
2.0 GB Download
md5:bc2d13ef3a81c4a5ac04f01f58c68d6c
2.0 GB Download
md5:199f435a1d2d0cc105b55a2d7771ef16
2.0 GB Download
md5:fa40c8bc94a4fbf356051fa0551a5ddd
2.0 GB Download
md5:0728ddab77524fa4c794ff10773833de
2.0 GB Download
md5:fc3619ae4b80a61288c8f92fb7b24cac
2.0 GB Download
md5:f244a2efd43a3c5e51caff3dd8637083
2.0 GB Download
md5:21cbdfc040317273f69f826dcaf88759
2.0 GB Download

Additional details

Software

Repository URL
https://github.com/SNL-UCSB/netFound
Programming language
Python
Development Status
Active