MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters
Authors/Creators
Description
Overview
This repository contains a trained binary classification model, exported to ONNX, that predicts whether a submitted job will fail or run successfully, given:
- the current state of a simulated datacenter cluster, and
- the resource request of an incoming job.
The model was developed within the MLSysOps research project and is intended for offline analysis, benchmarking, and integration into scheduling or admission-control pipelines.
Problem Statement
Modern large-scale clusters must decide whether to admit a job under uncertainty. Poor placement decisions can lead to job failures, even when aggregate resources appear sufficient.
In this work, a job failure can occur due to two distinct causes:
-
Insufficient compute resources (servers)
If the cluster does not have enough free servers to satisfy the job request, failure can be determined through a simple availability check. -
Insufficient or infeasible network connectivity (uplinks)
Even when the total number of uplinks appears sufficient, the job may still fail because the required connectivity cannot be realized.
The latter case arises from the presence of a reconfigurable optical circuit switch (OCS) interconnecting leaf switches. Although OCS-based fabrics provide high bandwidth and flexibility, they introduce topological and temporal constraints: not all feasible matchings between leaf switches can be realized simultaneously, and reconfiguration constraints may prevent forming the necessary end-to-end paths.
As a result, uplink feasibility is not a simple counting problem, but a combinatorial one that depends on:
- the current circuit configuration,
- contention with existing jobs,
- and connectivity constraints imposed by the optical fabric.
Goal:
The model learns to predict whether a job will fail due to either compute insufficiency or network infeasibility, based on a snapshot of the cluster state and the job request.
Dataset
The model was trained and evaluated using a large-scale simulated dataset of job placement attempts.
📎 Dataset repository (Zenodo):
👉 https://zenodo.org/records/18485585
The dataset repository provides:
- detailed system context,
- feature descriptions,
- ground-truth label semantics,
- statistical summaries,
- and usage examples.
⚠️ The dataset is released separately and is required to reproduce training or evaluation results.
Model Summary
- Task: Binary classification (job failure prediction)
- Framework: PyTorch
- Training orchestration: Ray Train / Ray Tune
- Export format: ONNX
- Inference backend: ONNX Runtime
The model consumes tabular features plus fixed-length vectors describing cluster utilization. Although the dataset distinguishes between different failure causes, the released model produces a binary output:
not failedfailed
Inputs and Preprocessing
The model expects:
-
scalar numeric features describing cluster utilization and fragmentation,
-
fixed-length vector features representing server and uplink utilization.
All preprocessing steps are defined in bundle.json, including:
- feature column order,
- normalization parameters (StandardScaler),
- vector dimensions.
⚠️ bundle.json must always be treated as the authoritative source of truth for model inputs.
Quick Start
Prerequisites
Install the required Python dependencies:
pip install numpy pandas pyarrow onnxruntime
or
pip install -r requirements.txt
Basic Usage
The src/inference_runtime.py script loads the ONNX model and preprocessing bundle, reads rows from a parquet file, and outputs predictions.
Run inference on the first 1000 rows
python src/inference_runtime.py \
--onnx model/model.onnx \
--bundle model/bundle.json \
--parquet model/data.parquet \
--n 1000
Output format (per row):
0 not failed proba=0.023456
1 failed proba=0.987654
2 not failed proba=0.012345
Evaluate metrics (if ground-truth labels are available)
If your parquet file includes the ground-truth label column, you can compute evaluation metrics:
python src/inference_runtime.py \
--onnx model/model.onnx \
--bundle model/bundle.json \
--parquet model/data.parquet \
--n 1000 \
--label_col l1_failed
Additional output:Metrics on loaded rows: accuracy=0.925980 precision=0.933392 recall=0.910949 f1=0.922034
Command-Line Arguments
The inference script (inference_runtime.py) supports the following command-line arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--onnx |
Yes | — | Path to the model.onnx file |
--bundle |
Yes | — | Path to bundle.json containing preprocessing metadata |
--parquet |
Yes | — | Path to the input Parquet file |
--n |
No | 1000 |
Number of rows to load from the Parquet file |
--label_col |
No | None |
Name of the ground-truth label column (used only for metrics) |
ℹ️ If
--label_colis not provided, the script performs inference only and does not compute evaluation metrics.
Notes
- The exact feature column order and normalization parameters are stored in bundle.json.
Model Constraints
The released model is subject to several explicit constraints that must be respected for correct and meaningful use.
Fixed Input Schema
- The model expects a fixed set of input features:
- scalar numeric features,
- a server utilization bitmap of fixed length,
- a leaf-switch utilization vector of fixed length.
- The exact feature order, normalization parameters, and vector lengths are defined in
bundle.json.
Fixed Cluster Topology Assumption
- The model is trained assuming a specific cluster architecture:
- 32 Scalable Units (SUs),
- 32 servers per SU (1024 total servers),
- 8 leaf switches per SU (256 total leaf uplinks).
- The server and uplink vectors are not dynamically resizable.
- Applying the model to clusters with:
- different numbers of servers,
- different SU layouts,
- or different network topologies
requires retraining or careful feature remapping and validation.
Binary Output Only
- Although the dataset distinguishes between:
- server-related failures, and
- uplink-related failures, the released model produces a binary output only:
failednot failed
- The model does not indicate why a failure is predicted.
Probabilistic Predictions
- The model outputs a probability of failure, not a deterministic decision.
- The default classification threshold is
0.5, but:- different operational settings may require different thresholds,
- threshold tuning should consider false-positive vs false-negative trade-offs.
- Predictions should be interpreted as risk estimates, not guarantees.
- It is intended to be used as a decision-support component, not as a standalone scheduler.
⚠️ Users integrating this model into larger systems should ensure that all constraints above are satisfied and validated before relying on predictions in operational workflows.
Citation
If you use this model, please cite it using the Zenodo DOI.
Acknowledgements & Funding
This work is part of the MLSysOps project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.
More information about the project is available at https://mlsysops.eu/
Files
model-job-placemenet-failure-prediction.zip
Files
(6.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:390beeb8d021cbe08d1ab4b16692d2f0
|
6.0 MB | Preview Download |
Additional details
Related works
- Is identical to
- Software: https://github.com/mlsysops-eu/model-job-placement-failure-prediction (URL)
Funding
Software
- Repository URL
- https://github.com/mlsysops-eu/model-job-placement-failure-prediction
- Programming language
- Python
- Development Status
- Active