Published April 28, 2026
| Version v1
Dataset
Open
UCSB netFound Pretraining Data Sampler
Authors/Creators
Description
This is a sampler of pretraining data for netFound network foundation model. This data contains network packet headers in .arrow format and excludes payload or IP addresses, as per netFound preprocessing pipeline. This data is supposed to be used with netFound tokenizer (or any derivative tokenizers).
Please, see https://github.com/SNL-UCSB/netFound for instructions on usage and location of the full pretraining dataset.
Data facts:
- Total collection time of the sampler: 2hrs.
- Total network flows in the sampler: 60,199,405
Files
Files
(15.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b5a5c1c8316cb30a6c1750ff85921b00
|
2.0 GB | Download |
|
md5:bc2d13ef3a81c4a5ac04f01f58c68d6c
|
2.0 GB | Download |
|
md5:199f435a1d2d0cc105b55a2d7771ef16
|
2.0 GB | Download |
|
md5:fa40c8bc94a4fbf356051fa0551a5ddd
|
2.0 GB | Download |
|
md5:0728ddab77524fa4c794ff10773833de
|
2.0 GB | Download |
|
md5:fc3619ae4b80a61288c8f92fb7b24cac
|
2.0 GB | Download |
|
md5:f244a2efd43a3c5e51caff3dd8637083
|
2.0 GB | Download |
|
md5:21cbdfc040317273f69f826dcaf88759
|
2.0 GB | Download |
Additional details
Software
- Repository URL
- https://github.com/SNL-UCSB/netFound
- Programming language
- Python
- Development Status
- Active