MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters

Patras, Alexandros; Syrivelis, Dimitris; Terzenidis, Nikolaos

doi:10.5281/zenodo.18486169

Published February 4, 2026 | Version 1.0.0

Model Open

MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters

1. University of Thessaly

Contributors

Researcher:

Aslanidis, Theodoros¹

1. University College Dublin

Overview

This repository contains a trained binary classification model, exported to ONNX, that predicts whether a submitted job will fail or run successfully, given:

the current state of a simulated datacenter cluster, and
the resource request of an incoming job.

The model was developed within the MLSysOps research project and is intended for offline analysis, benchmarking, and integration into scheduling or admission-control pipelines.

Problem Statement

Modern large-scale clusters must decide whether to admit a job under uncertainty. Poor placement decisions can lead to job failures, even when aggregate resources appear sufficient.

In this work, a job failure can occur due to two distinct causes:

Insufficient compute resources (servers)
If the cluster does not have enough free servers to satisfy the job request, failure can be determined through a simple availability check.
Insufficient or infeasible network connectivity (uplinks)
Even when the total number of uplinks appears sufficient, the job may still fail because the required connectivity cannot be realized.

The latter case arises from the presence of a reconfigurable optical circuit switch (OCS) interconnecting leaf switches. Although OCS-based fabrics provide high bandwidth and flexibility, they introduce topological and temporal constraints: not all feasible matchings between leaf switches can be realized simultaneously, and reconfiguration constraints may prevent forming the necessary end-to-end paths.

As a result, uplink feasibility is not a simple counting problem, but a combinatorial one that depends on:

the current circuit configuration,
contention with existing jobs,
and connectivity constraints imposed by the optical fabric.

Goal:
The model learns to predict whether a job will fail due to either compute insufficiency or network infeasibility, based on a snapshot of the cluster state and the job request.

Dataset

The model was trained and evaluated using a large-scale simulated dataset of job placement attempts.

📎 Dataset repository (Zenodo):
👉 https://zenodo.org/records/18485585

The dataset repository provides:

detailed system context,
feature descriptions,
ground-truth label semantics,
statistical summaries,
and usage examples.

⚠️ The dataset is released separately and is required to reproduce training or evaluation results.

Model Summary

Task: Binary classification (job failure prediction)
Framework: PyTorch
Training orchestration: Ray Train / Ray Tune
Export format: ONNX
Inference backend: ONNX Runtime

The model consumes tabular features plus fixed-length vectors describing cluster utilization. Although the dataset distinguishes between different failure causes, the released model produces a binary output:

not failed
failed

Inputs and Preprocessing

The model expects:

scalar numeric features describing cluster utilization and fragmentation,
fixed-length vector features representing server and uplink utilization.

All preprocessing steps are defined in bundle.json, including:

feature column order,
normalization parameters (StandardScaler),
vector dimensions.

⚠️ bundle.json must always be treated as the authoritative source of truth for model inputs.

Quick Start

Prerequisites

Install the required Python dependencies:

pip install numpy pandas pyarrow onnxruntime

or

pip install -r requirements.txt

Basic Usage

The src/inference_runtime.py script loads the ONNX model and preprocessing bundle, reads rows from a parquet file, and outputs predictions.

Run inference on the first 1000 rows

python src/inference_runtime.py \ 
  --onnx model/model.onnx \
  --bundle model/bundle.json \
  --parquet model/data.parquet \
  --n 1000

Output format (per row):

0    not failed    proba=0.023456
1    failed        proba=0.987654
2    not failed    proba=0.012345

Evaluate metrics (if ground-truth labels are available)

If your parquet file includes the ground-truth label column, you can compute evaluation metrics:

python src/inference_runtime.py \
  --onnx model/model.onnx \
  --bundle model/bundle.json \
  --parquet model/data.parquet \
  --n 1000 \
  --label_col l1_failed

Additional output:

Metrics on loaded rows:
accuracy=0.925980 precision=0.933392 recall=0.910949 f1=0.922034

Command-Line Arguments

The inference script (inference_runtime.py) supports the following command-line arguments:

Argument	Required	Default	Description
`--onnx`	Yes	—	Path to the `model.onnx` file
`--bundle`	Yes	—	Path to `bundle.json` containing preprocessing metadata
`--parquet`	Yes	—	Path to the input Parquet file
`--n`	No	`1000`	Number of rows to load from the Parquet file
`--label_col`	No	`None`	Name of the ground-truth label column (used only for metrics)

ℹ️ If --label_col is not provided, the script performs inference only and does not compute evaluation metrics.

Notes

The exact feature column order and normalization parameters are stored in bundle.json.

Model Constraints

The released model is subject to several explicit constraints that must be respected for correct and meaningful use.

Fixed Input Schema

The model expects a fixed set of input features:
- scalar numeric features,
- a server utilization bitmap of fixed length,
- a leaf-switch utilization vector of fixed length.
The exact feature order, normalization parameters, and vector lengths are defined in bundle.json.

Fixed Cluster Topology Assumption

The model is trained assuming a specific cluster architecture:
- 32 Scalable Units (SUs),
- 32 servers per SU (1024 total servers),
- 8 leaf switches per SU (256 total leaf uplinks).
The server and uplink vectors are not dynamically resizable.
Applying the model to clusters with:
- different numbers of servers,
- different SU layouts,
- or different network topologies
  requires retraining or careful feature remapping and validation.

Binary Output Only

Although the dataset distinguishes between:
- server-related failures, and
- uplink-related failures, the released model produces a binary output only:
  - failed
  - not failed
The model does not indicate why a failure is predicted.

Probabilistic Predictions

The model outputs a probability of failure, not a deterministic decision.
The default classification threshold is 0.5, but:
- different operational settings may require different thresholds,
- threshold tuning should consider false-positive vs false-negative trade-offs.
Predictions should be interpreted as risk estimates, not guarantees.
It is intended to be used as a decision-support component, not as a standalone scheduler.

⚠️ Users integrating this model into larger systems should ensure that all constraints above are satisfied and validated before relying on predictions in operational workflows.

Citation

If you use this model, please cite it using the Zenodo DOI.

Acknowledgements & Funding

This work is part of the MLSysOps project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.

More information about the project is available at https://mlsysops.eu/

Files

model-job-placemenet-failure-prediction.zip

Files (6.0 MB)

Name	Size	Download all
model-job-placemenet-failure-prediction.zip md5:390beeb8d021cbe08d1ab4b16692d2f0	6.0 MB	Preview Download

Additional details

Is identical to: Software: https://github.com/mlsysops-eu/model-job-placement-failure-prediction (URL)

European Commission
MLSysOps - Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum 101092912

Repository URL: https://github.com/mlsysops-eu/model-job-placement-failure-prediction
Programming language: Python
Development Status: Active

	All versions	This version
Views	33	33
Downloads	2	2
Data volume	12.0 MB	12.0 MB

MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters

Authors/Creators

Contributors

Researcher:

Description

Overview

Problem Statement

Dataset

Model Summary

Inputs and Preprocessing

Quick Start

Prerequisites

Basic Usage

Run inference on the first 1000 rows

Evaluate metrics (if ground-truth labels are available)

Command-Line Arguments

Notes

Model Constraints

Fixed Input Schema

Fixed Cluster Topology Assumption

Binary Output Only

Probabilistic Predictions

Citation

Acknowledgements & Funding

Files

model-job-placemenet-failure-prediction.zip

Files (6.0 MB)

Additional details

Related works

Funding

Software