MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks
Authors/Creators
Description
đ Overview
This dataset contains cluster-level snapshots and job placement outcomes generated using a simulated large-scale datacenter environment.
The data is intended for training and evaluating machine learning models that predict whether a job submission will succeed or fail given the current cluster state and job resource request.
The dataset was produced as part of the MLSysOps project (EU Horizon Europe) and supports research on:
- job admission control,
- failure prediction,
- resource fragmentation,
- and network feasibility in modern datacenter architectures.
Each data sample represents a single scheduling decision and includes both:
- detailed cluster state features, and
- the observed outcome of the placement attempt.
đ˘ System Context
Simulated Datacenter Architecture
The dataset is generated using a proprietary datacenter simulator modeling a hierarchical cluster composed of Scalable Units (SUs).
Cluster configuration:
- 32 Scalable Units (SUs)
- 32 servers per SU (1024 servers total)
- 8 leaf switches per SU
- 8 GPUs per server
- Leaf switches interconnected via a reconfigurable optical circuit switch (OCS)
Failure Modes Captured
Each job placement attempt can result in:
- Successful placement
- Failure due to insufficient servers
- Failure due to insufficient or infeasible uplink connectivity
While server insufficiency can be determined via simple capacity checks,
uplink infeasibility is more complex, as it depends on:
- current optical circuit configurations,
- contention between jobs,
- and connectivity constraints of the OCS fabric.
The dataset explicitly captures these outcomes to support learning-based approaches for failure prediction.
đ Dataset Structure
- Format: Apache Parquet
- Granularity: One row per scheduling decision
- Each row contains:
- Job request features
- Cluster state features (scalar + vector)
- Ground-truth placement outcome label
Rows are treated as independent samples.
đˇď¸ Ground-Truth Labels
The dataset includes a label column encoding the observed outcome of the job placement:
| Value | Meaning |
|---|---|
0 |
Job placement succeeded |
1 |
Job placement failed due to insufficient servers |
2 |
Job placement failed due to insufficient uplinks / infeasible network connectivity |
Notes:
- Labels
1and2both indicate job failure, but with different root causes. - This encoding allows:
- binary failure prediction,
- failure cause analysis,
- and future multi-class modeling.
đ Feature Description
Scalar Cluster Features
These features summarize utilization, imbalance, and fragmentation across the cluster:
| Column | Description |
|---|---|
f1_event_type |
The recorded event: add, failed_server, failed_uplink |
f2_mean_util |
Mean server utilization |
f3_diff_max_min_util |
Utilization imbalance across SUs |
f4_cv_util |
Coefficient of variation of server utilization |
f5_ratio_max_to_mean_workload |
Workload skew across SUs |
f6_mean_uplink_util |
Mean uplink utilization |
f7_diff_max_min_uplink_util |
Uplink utilization imbalance |
f8_cv_uplink_util |
Coefficient of variation of uplink utilization |
f9_mean_combined_util |
Combined compute and network utilization |
f10_resource_imbalance |
Compute vs network mismatch |
f11_bottleneck_ratio |
Network-to-compute utilization ratio |
f12_frag_spread_sus |
Fragmentation due to SU spread |
f13_frag_wasted |
Fragmentation due to wasted capacity |
f14_frag_su_sparseness |
Intra-SU sparseness |
f15_total_servers_used |
Total servers in use |
f16_total_sus_used |
Number of active SUs |
f17_total_uplink_utilized |
Total uplink usage |
Vector Features
| Feature | Description |
|---|---|
f18_su_server_bitmap |
Binary vector (length 1024) indicating per-server usage |
f19_leaf_up |
Vector (length 256) indicating leaf switch uplink utilization |
Job Request Feature
| Column | Type | Description |
|---|---|---|
f20_requested_nodes |
int / float | Number of nodes requested by the job |
đ§Ş Data Collection Methodology
- Environment: Simulated datacenter
- Workloads: Synthetic job traces with varying sizes and arrival patterns
- Placement policy: Simulator-internal scheduling logic
- Labeling: Determined by placement outcome (success or failure cause)
The simulator executes job placement attempts under varying load, fragmentation, and network conditions to generate diverse training examples.
â ď¸ The simulator itself is not publicly released. Only the resulting dataset is provided.
đ Statistical Summary
The dataset contains a total of 1,062,943 rows, each corresponding to a single job placement attempt in the simulated cluster.
The table below summarizes the distribution of all numeric columns, including the ground-truth label.
Column Summary and Data Types
| Descriptor | Type | Count | Mean | Std | Min | 25% | 50% (Median) | 75% | Max |
|---|---|---|---|---|---|---|---|---|---|
l1_failed |
int32 | 1,062,943 | 0.6198 | 0.7127 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
f2_mean_util |
float32 | 1,062,943 | 0.9129 | 0.1089 | 0.0078 | 0.8906 | 0.9404 | 0.9717 | 1.0 |
f3_diff_max_min_util |
float32 | 1,062,943 | 0.5843 | 0.3332 | 0.0 | 0.2813 | 0.5625 | 1.0 | 1.0 |
f4_cv_util |
float32 | 1,062,943 | 0.1982 | 0.3047 | 0.0 | 0.0712 | 0.1446 | 0.2500 | 5.5678 |
f5_ratio_max_to_mean_workload |
float32 | 1,062,943 | 1.1865 | 1.2299 | 1.0 | 1.0291 | 1.0633 | 1.1228 | 32.0 |
f6_mean_uplink_util |
float32 | 1,062,943 | 0.5637 | 0.1156 | 0.0 | 0.5195 | 0.5840 | 0.6367 | 0.9598 |
f7_diff_max_min_uplink_util |
float32 | 1,062,943 | 0.9016 | 0.1717 | 0.0 | 0.8125 | 0.9063 | 0.9688 | 2.0 |
f8_cv_uplink_util |
float32 | 1,062,943 | 0.4681 | 0.3022 | 0.0 | 0.3526 | 0.4258 | 0.5076 | 3.8730 |
f9_mean_combined_util |
float32 | 1,062,943 | 0.7383 | 0.1002 | 0.0039 | 0.7139 | 0.7563 | 0.7915 | 0.9716 |
f10_resource_imbalance |
float32 | 1,062,943 | 0.3493 | 0.1010 | 0.0001 | 0.2803 | 0.3350 | 0.4043 | 0.8926 |
f11_bottleneck_ratio |
float32 | 1,062,943 | 0.6137 | 0.1137 | 0.0 | 0.5636 | 0.6345 | 0.6894 | 1.7378 |
f12_frag_spread_sus |
float32 | 1,062,943 | 1.0713 | 0.0818 | 1.0 | 1.0261 | 1.0524 | 1.0922 | 4.0 |
f13_frag_wasted |
float32 | 1,062,943 | 0.0713 | 0.0818 | 0.0 | 0.0261 | 0.0524 | 0.0922 | 3.0 |
f14_frag_su_sparseness |
float32 | 1,062,943 | 0.0177 | 0.0176 | 0.0 | 0.0065 | 0.0135 | 0.0235 | 0.2589 |
f15_total_servers_used |
int64 | 1,062,943 | 934.86 | 111.47 | 8 | 912 | 963 | 995 | 1024 |
f16_total_sus_used |
int64 | 1,062,943 | 31.11 | 3.10 | 1 | 31 | 32 | 32 | 32 |
f17_total_uplink_utilized |
int64 | 1,062,943 | 4617.66 | 946.63 | 0 | 4256 | 4784 | 5216 | 7863 |
f20_requested_nodes |
int64 | 1,062,943 | 54.96 | 38.77 | 8 | 20 | 44 | 87 | 128 |
Ground-truth label distribution note:
The l1_failed column encodes job outcomes as:
0: success1: failure due to insufficient servers2: failure due to insufficient uplinks / infeasible connectivity
Both 1 and 2 correspond to job failures.
đ§° Working with the Data
Loading the Dataset (Python)
import pandas as pd
df = pd.read_parquet("final_merged.parquet")
print(df.head())
Loading Selected Columns
cols = ["f20_requested_nodes", "f2_mean_util", "l1_failed"]
df = pd.read_parquet("final_merged.parquet", columns=cols)
Tools and Documentation
Apache Parquet specification: https://parquet.apache.org/docs
Pandas Parquet I/O: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
PyArrow Parquet support: https://arrow.apache.org/docs/python/parquet.html
đŻ Intended Use
This dataset is intended for:
- machine learning research on job failure prediction,
- benchmarking admission-control models,
- studying resource fragmentation and network feasibility,
- offline evaluation of scheduling heuristics.
It is not intended to represent any specific production datacenter.
â ď¸ Limitations
- Data is generated from a simulator, not a production system.
- The cluster topology is fixed and may not generalize to other architectures.
- Temporal dependencies between jobs are not explicitly modeled.
- Network behavior is abstracted and may differ from real optical fabrics.
đ Citation
If you use this dataset in your research, please cite it using the citation provided by Zenodo (available in the right sidebar of the dataset record
đ¤ Acknowledgements & Funding
This work is part of the MLSysOps project and is funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.
More information: https://mlsysops.eu/
Files
data_sample.csv
Files
(61.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7cc8ddab6536cdf17745ea52f00ab32e
|
61.7 MB | Download |
|
md5:11b526b84a30cd250a3aee72bf2a60ef
|
66.9 kB | Preview Download |
|
md5:ae0c357236fa8388f9cd2c69d3c5d428
|
2.4 kB | Preview Download |
|
md5:b940de0adf232a4176d3de3dbffa02dd
|
1.9 kB | Download |
|
md5:27eacfc9d273b6f6e5891e63848b495b
|
9.9 kB | Preview Download |