MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks

Patras, Alexandros; Syrivelis, Dimitris; Terzenidis, Nikolaos

doi:10.5281/zenodo.18485585

Published February 4, 2026 | Version 1.0.0

Dataset Open

MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks

1. University of Thessaly

Contributors

Data curator:

Aslanidis, Theodoros¹

1. University College Dublin

📌 Overview

This dataset contains cluster-level snapshots and job placement outcomes generated using a simulated large-scale datacenter environment.
The data is intended for training and evaluating machine learning models that predict whether a job submission will succeed or fail given the current cluster state and job resource request.

The dataset was produced as part of the MLSysOps project (EU Horizon Europe) and supports research on:

job admission control,
failure prediction,
resource fragmentation,
and network feasibility in modern datacenter architectures.

Each data sample represents a single scheduling decision and includes both:

detailed cluster state features, and
the observed outcome of the placement attempt.

🏢 System Context

Simulated Datacenter Architecture

The dataset is generated using a proprietary datacenter simulator modeling a hierarchical cluster composed of Scalable Units (SUs).

Cluster configuration:

32 Scalable Units (SUs)
32 servers per SU (1024 servers total)
8 leaf switches per SU
8 GPUs per server
Leaf switches interconnected via a reconfigurable optical circuit switch (OCS)

Failure Modes Captured

Each job placement attempt can result in:

Successful placement
Failure due to insufficient servers
Failure due to insufficient or infeasible uplink connectivity

While server insufficiency can be determined via simple capacity checks,
uplink infeasibility is more complex, as it depends on:

current optical circuit configurations,
contention between jobs,
and connectivity constraints of the OCS fabric.

The dataset explicitly captures these outcomes to support learning-based approaches for failure prediction.

📂 Dataset Structure

Format: Apache Parquet
Granularity: One row per scheduling decision
Each row contains:
1. Job request features
2. Cluster state features (scalar + vector)
3. Ground-truth placement outcome label

Rows are treated as independent samples.

🏷️ Ground-Truth Labels

The dataset includes a label column encoding the observed outcome of the job placement:

Value	Meaning
`0`	Job placement succeeded
`1`	Job placement failed due to insufficient servers
`2`	Job placement failed due to insufficient uplinks / infeasible network connectivity

Notes:

Labels 1 and 2 both indicate job failure, but with different root causes.
This encoding allows:
- binary failure prediction,
- failure cause analysis,
- and future multi-class modeling.

📊 Feature Description

Scalar Cluster Features

These features summarize utilization, imbalance, and fragmentation across the cluster:

Column	Description
`f1_event_type`	The recorded event: add, failed_server, failed_uplink
`f2_mean_util`	Mean server utilization
`f3_diff_max_min_util`	Utilization imbalance across SUs
`f4_cv_util`	Coefficient of variation of server utilization
`f5_ratio_max_to_mean_workload`	Workload skew across SUs
`f6_mean_uplink_util`	Mean uplink utilization
`f7_diff_max_min_uplink_util`	Uplink utilization imbalance
`f8_cv_uplink_util`	Coefficient of variation of uplink utilization
`f9_mean_combined_util`	Combined compute and network utilization
`f10_resource_imbalance`	Compute vs network mismatch
`f11_bottleneck_ratio`	Network-to-compute utilization ratio
`f12_frag_spread_sus`	Fragmentation due to SU spread
`f13_frag_wasted`	Fragmentation due to wasted capacity
`f14_frag_su_sparseness`	Intra-SU sparseness
`f15_total_servers_used`	Total servers in use
`f16_total_sus_used`	Number of active SUs
`f17_total_uplink_utilized`	Total uplink usage

Vector Features

Feature	Description
`f18_su_server_bitmap`	Binary vector (length 1024) indicating per-server usage
`f19_leaf_up`	Vector (length 256) indicating leaf switch uplink utilization

Job Request Feature

Column	Type	Description
`f20_requested_nodes`	int / float	Number of nodes requested by the job

🧪 Data Collection Methodology

Environment: Simulated datacenter
Workloads: Synthetic job traces with varying sizes and arrival patterns
Placement policy: Simulator-internal scheduling logic
Labeling: Determined by placement outcome (success or failure cause)

The simulator executes job placement attempts under varying load, fragmentation, and network conditions to generate diverse training examples.

⚠️ The simulator itself is not publicly released. Only the resulting dataset is provided.

📊 Statistical Summary

The dataset contains a total of 1,062,943 rows, each corresponding to a single job placement attempt in the simulated cluster.

The table below summarizes the distribution of all numeric columns, including the ground-truth label.

Column Summary and Data Types

Descriptor	Type	Count	Mean	Std	Min	25%	50% (Median)	75%	Max
`l1_failed`	int32	1,062,943	0.6198	0.7127	0.0	0.0	0.0	1.0	2.0
`f2_mean_util`	float32	1,062,943	0.9129	0.1089	0.0078	0.8906	0.9404	0.9717	1.0
`f3_diff_max_min_util`	float32	1,062,943	0.5843	0.3332	0.0	0.2813	0.5625	1.0	1.0
`f4_cv_util`	float32	1,062,943	0.1982	0.3047	0.0	0.0712	0.1446	0.2500	5.5678
`f5_ratio_max_to_mean_workload`	float32	1,062,943	1.1865	1.2299	1.0	1.0291	1.0633	1.1228	32.0
`f6_mean_uplink_util`	float32	1,062,943	0.5637	0.1156	0.0	0.5195	0.5840	0.6367	0.9598
`f7_diff_max_min_uplink_util`	float32	1,062,943	0.9016	0.1717	0.0	0.8125	0.9063	0.9688	2.0
`f8_cv_uplink_util`	float32	1,062,943	0.4681	0.3022	0.0	0.3526	0.4258	0.5076	3.8730
`f9_mean_combined_util`	float32	1,062,943	0.7383	0.1002	0.0039	0.7139	0.7563	0.7915	0.9716
`f10_resource_imbalance`	float32	1,062,943	0.3493	0.1010	0.0001	0.2803	0.3350	0.4043	0.8926
`f11_bottleneck_ratio`	float32	1,062,943	0.6137	0.1137	0.0	0.5636	0.6345	0.6894	1.7378
`f12_frag_spread_sus`	float32	1,062,943	1.0713	0.0818	1.0	1.0261	1.0524	1.0922	4.0
`f13_frag_wasted`	float32	1,062,943	0.0713	0.0818	0.0	0.0261	0.0524	0.0922	3.0
`f14_frag_su_sparseness`	float32	1,062,943	0.0177	0.0176	0.0	0.0065	0.0135	0.0235	0.2589
`f15_total_servers_used`	int64	1,062,943	934.86	111.47	8	912	963	995	1024
`f16_total_sus_used`	int64	1,062,943	31.11	3.10	1	31	32	32	32
`f17_total_uplink_utilized`	int64	1,062,943	4617.66	946.63	0	4256	4784	5216	7863
`f20_requested_nodes`	int64	1,062,943	54.96	38.77	8	20	44	87	128

Ground-truth label distribution note:
The l1_failed column encodes job outcomes as:

0: success
1: failure due to insufficient servers
2: failure due to insufficient uplinks / infeasible connectivity

Both 1 and 2 correspond to job failures.

🧰 Working with the Data

Loading the Dataset (Python)

import pandas as pd

df = pd.read_parquet("final_merged.parquet")
print(df.head())

Loading Selected Columns

cols = ["f20_requested_nodes", "f2_mean_util", "l1_failed"]
df = pd.read_parquet("final_merged.parquet", columns=cols)

Tools and Documentation

Apache Parquet specification: https://parquet.apache.org/docs

Pandas Parquet I/O: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

PyArrow Parquet support: https://arrow.apache.org/docs/python/parquet.html

🎯 Intended Use

This dataset is intended for:

machine learning research on job failure prediction,
benchmarking admission-control models,
studying resource fragmentation and network feasibility,
offline evaluation of scheduling heuristics.

It is not intended to represent any specific production datacenter.

⚠️ Limitations

Data is generated from a simulator, not a production system.
The cluster topology is fixed and may not generalize to other architectures.
Temporal dependencies between jobs are not explicitly modeled.
Network behavior is abstracted and may differ from real optical fabrics.

📜 Citation

If you use this dataset in your research, please cite it using the citation provided by Zenodo (available in the right sidebar of the dataset record

🤝 Acknowledgements & Funding

This work is part of the MLSysOps project and is funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.

More information: https://mlsysops.eu/

Files

data_sample.csv

Files (61.8 MB)

Name	Size	Download all
data.parquet md5:7cc8ddab6536cdf17745ea52f00ab32e	61.7 MB	Download
data_sample.csv md5:11b526b84a30cd250a3aee72bf2a60ef	66.9 kB	Preview Download
dataset_statistics.csv md5:ae0c357236fa8388f9cd2c69d3c5d428	2.4 kB	Preview Download
generate_stats.py md5:b940de0adf232a4176d3de3dbffa02dd	1.9 kB	Download
README.md md5:27eacfc9d273b6f6e5891e63848b495b	9.9 kB	Preview Download

Additional details

European Commission
MLSysOps - Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum 101092912

	All versions	This version
Views	49	49
Downloads	38	38
Data volume	311.1 MB	311.1 MB

MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks

Authors/Creators

Contributors

Data curator:

Description

📌 Overview

🏢 System Context

Simulated Datacenter Architecture

Failure Modes Captured

📂 Dataset Structure

🏷️ Ground-Truth Labels

📊 Feature Description

Scalar Cluster Features

Vector Features

Job Request Feature

🧪 Data Collection Methodology

📊 Statistical Summary

Column Summary and Data Types

🧰 Working with the Data

Loading the Dataset (Python)

Loading Selected Columns

Tools and Documentation

🎯 Intended Use

⚠️ Limitations

📜 Citation

🤝 Acknowledgements & Funding

Files

data_sample.csv

Files (61.8 MB)

Additional details

Funding