Published February 4, 2026 | Version 1.0.0
Dataset Open

MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks

Contributors

  • 1. ROR icon University College Dublin

Description

📌 Overview

This dataset contains cluster-level snapshots and job placement outcomes generated using a simulated large-scale datacenter environment.
The data is intended for training and evaluating machine learning models that predict whether a job submission will succeed or fail given the current cluster state and job resource request.

The dataset was produced as part of the MLSysOps project (EU Horizon Europe) and supports research on:

  • job admission control,
  • failure prediction,
  • resource fragmentation,
  • and network feasibility in modern datacenter architectures.

Each data sample represents a single scheduling decision and includes both:

  • detailed cluster state features, and
  • the observed outcome of the placement attempt.

🏢 System Context

Simulated Datacenter Architecture

The dataset is generated using a proprietary datacenter simulator modeling a hierarchical cluster composed of Scalable Units (SUs).

Cluster configuration:

  • 32 Scalable Units (SUs)
  • 32 servers per SU (1024 servers total)
  • 8 leaf switches per SU
  • 8 GPUs per server
  • Leaf switches interconnected via a reconfigurable optical circuit switch (OCS)

Failure Modes Captured

Each job placement attempt can result in:

  1. Successful placement
  2. Failure due to insufficient servers
  3. Failure due to insufficient or infeasible uplink connectivity

While server insufficiency can be determined via simple capacity checks,
uplink infeasibility is more complex, as it depends on:

  • current optical circuit configurations,
  • contention between jobs,
  • and connectivity constraints of the OCS fabric.

The dataset explicitly captures these outcomes to support learning-based approaches for failure prediction.

📂 Dataset Structure

  • Format: Apache Parquet
  • Granularity: One row per scheduling decision
  • Each row contains:
    1. Job request features
    2. Cluster state features (scalar + vector)
    3. Ground-truth placement outcome label

Rows are treated as independent samples.

🏷️ Ground-Truth Labels

The dataset includes a label column encoding the observed outcome of the job placement:

Value Meaning
0 Job placement succeeded
1 Job placement failed due to insufficient servers
2 Job placement failed due to insufficient uplinks / infeasible network connectivity

Notes:

  • Labels 1 and 2 both indicate job failure, but with different root causes.
  • This encoding allows:
    • binary failure prediction,
    • failure cause analysis,
    • and future multi-class modeling.

📊 Feature Description

Scalar Cluster Features

These features summarize utilization, imbalance, and fragmentation across the cluster:

Column Description
f1_event_type The recorded event: add, failed_server, failed_uplink
f2_mean_util Mean server utilization
f3_diff_max_min_util Utilization imbalance across SUs
f4_cv_util Coefficient of variation of server utilization
f5_ratio_max_to_mean_workload Workload skew across SUs
f6_mean_uplink_util Mean uplink utilization
f7_diff_max_min_uplink_util Uplink utilization imbalance
f8_cv_uplink_util Coefficient of variation of uplink utilization
f9_mean_combined_util Combined compute and network utilization
f10_resource_imbalance Compute vs network mismatch
f11_bottleneck_ratio Network-to-compute utilization ratio
f12_frag_spread_sus Fragmentation due to SU spread
f13_frag_wasted Fragmentation due to wasted capacity
f14_frag_su_sparseness Intra-SU sparseness
f15_total_servers_used Total servers in use
f16_total_sus_used Number of active SUs
f17_total_uplink_utilized Total uplink usage

Vector Features

Feature Description
f18_su_server_bitmap Binary vector (length 1024) indicating per-server usage
f19_leaf_up Vector (length 256) indicating leaf switch uplink utilization

Job Request Feature

Column Type Description
f20_requested_nodes int / float Number of nodes requested by the job

🧪 Data Collection Methodology

  • Environment: Simulated datacenter
  • Workloads: Synthetic job traces with varying sizes and arrival patterns
  • Placement policy: Simulator-internal scheduling logic
  • Labeling: Determined by placement outcome (success or failure cause)

The simulator executes job placement attempts under varying load, fragmentation, and network conditions to generate diverse training examples.

⚠️ The simulator itself is not publicly released. Only the resulting dataset is provided.

📊 Statistical Summary

The dataset contains a total of 1,062,943 rows, each corresponding to a single job placement attempt in the simulated cluster.

The table below summarizes the distribution of all numeric columns, including the ground-truth label.

Column Summary and Data Types

Descriptor Type Count Mean Std Min 25% 50% (Median) 75% Max
l1_failed int32 1,062,943 0.6198 0.7127 0.0 0.0 0.0 1.0 2.0
f2_mean_util float32 1,062,943 0.9129 0.1089 0.0078 0.8906 0.9404 0.9717 1.0
f3_diff_max_min_util float32 1,062,943 0.5843 0.3332 0.0 0.2813 0.5625 1.0 1.0
f4_cv_util float32 1,062,943 0.1982 0.3047 0.0 0.0712 0.1446 0.2500 5.5678
f5_ratio_max_to_mean_workload float32 1,062,943 1.1865 1.2299 1.0 1.0291 1.0633 1.1228 32.0
f6_mean_uplink_util float32 1,062,943 0.5637 0.1156 0.0 0.5195 0.5840 0.6367 0.9598
f7_diff_max_min_uplink_util float32 1,062,943 0.9016 0.1717 0.0 0.8125 0.9063 0.9688 2.0
f8_cv_uplink_util float32 1,062,943 0.4681 0.3022 0.0 0.3526 0.4258 0.5076 3.8730
f9_mean_combined_util float32 1,062,943 0.7383 0.1002 0.0039 0.7139 0.7563 0.7915 0.9716
f10_resource_imbalance float32 1,062,943 0.3493 0.1010 0.0001 0.2803 0.3350 0.4043 0.8926
f11_bottleneck_ratio float32 1,062,943 0.6137 0.1137 0.0 0.5636 0.6345 0.6894 1.7378
f12_frag_spread_sus float32 1,062,943 1.0713 0.0818 1.0 1.0261 1.0524 1.0922 4.0
f13_frag_wasted float32 1,062,943 0.0713 0.0818 0.0 0.0261 0.0524 0.0922 3.0
f14_frag_su_sparseness float32 1,062,943 0.0177 0.0176 0.0 0.0065 0.0135 0.0235 0.2589
f15_total_servers_used int64 1,062,943 934.86 111.47 8 912 963 995 1024
f16_total_sus_used int64 1,062,943 31.11 3.10 1 31 32 32 32
f17_total_uplink_utilized int64 1,062,943 4617.66 946.63 0 4256 4784 5216 7863
f20_requested_nodes int64 1,062,943 54.96 38.77 8 20 44 87 128

Ground-truth label distribution note:
The l1_failed column encodes job outcomes as:

  • 0: success
  • 1: failure due to insufficient servers
  • 2: failure due to insufficient uplinks / infeasible connectivity

Both 1 and 2 correspond to job failures.

🧰 Working with the Data

Loading the Dataset (Python)

import pandas as pd

df = pd.read_parquet("final_merged.parquet")
print(df.head())

Loading Selected Columns

 
cols = ["f20_requested_nodes", "f2_mean_util", "l1_failed"]
df = pd.read_parquet("final_merged.parquet", columns=cols)

Tools and Documentation

Apache Parquet specification: https://parquet.apache.org/docs

Pandas Parquet I/O: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

PyArrow Parquet support: https://arrow.apache.org/docs/python/parquet.html

🎯 Intended Use

This dataset is intended for:

  • machine learning research on job failure prediction,
  • benchmarking admission-control models,
  • studying resource fragmentation and network feasibility,
  • offline evaluation of scheduling heuristics.

It is not intended to represent any specific production datacenter.

⚠️ Limitations

  • Data is generated from a simulator, not a production system.
  • The cluster topology is fixed and may not generalize to other architectures.
  • Temporal dependencies between jobs are not explicitly modeled.
  • Network behavior is abstracted and may differ from real optical fabrics.

📜 Citation

If you use this dataset in your research, please cite it using the citation provided by Zenodo (available in the right sidebar of the dataset record

🤝 Acknowledgements & Funding

This work is part of the MLSysOps project and is funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.

More information: https://mlsysops.eu/

Files

data_sample.csv

Files (61.8 MB)

Name Size Download all
md5:7cc8ddab6536cdf17745ea52f00ab32e
61.7 MB Download
md5:11b526b84a30cd250a3aee72bf2a60ef
66.9 kB Preview Download
md5:ae0c357236fa8388f9cd2c69d3c5d428
2.4 kB Preview Download
md5:b940de0adf232a4176d3de3dbffa02dd
1.9 kB Download
md5:27eacfc9d273b6f6e5891e63848b495b
9.9 kB Preview Download

Additional details

Funding

European Commission
MLSysOps - Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum 101092912