Published February 4, 2026 | Version 1.0.0
Model Open

MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters

Contributors

  • 1. ROR icon University College Dublin

Description

Overview

This repository contains a trained binary classification model, exported to ONNX, that predicts whether a submitted job will fail or run successfully, given:

  • the current state of a simulated datacenter cluster, and
  • the resource request of an incoming job.

The model was developed within the MLSysOps research project and is intended for offline analysis, benchmarking, and integration into scheduling or admission-control pipelines.

Problem Statement

Modern large-scale clusters must decide whether to admit a job under uncertainty. Poor placement decisions can lead to job failures, even when aggregate resources appear sufficient.

In this work, a job failure can occur due to two distinct causes:

  1. Insufficient compute resources (servers)
    If the cluster does not have enough free servers to satisfy the job request, failure can be determined through a simple availability check.

  2. Insufficient or infeasible network connectivity (uplinks)
    Even when the total number of uplinks appears sufficient, the job may still fail because the required connectivity cannot be realized.

The latter case arises from the presence of a reconfigurable optical circuit switch (OCS) interconnecting leaf switches. Although OCS-based fabrics provide high bandwidth and flexibility, they introduce topological and temporal constraints: not all feasible matchings between leaf switches can be realized simultaneously, and reconfiguration constraints may prevent forming the necessary end-to-end paths.

As a result, uplink feasibility is not a simple counting problem, but a combinatorial one that depends on:

  • the current circuit configuration,
  • contention with existing jobs,
  • and connectivity constraints imposed by the optical fabric.

Goal:
The model learns to predict whether a job will fail due to either compute insufficiency or network infeasibility, based on a snapshot of the cluster state and the job request.

Dataset

The model was trained and evaluated using a large-scale simulated dataset of job placement attempts.

📎 Dataset repository (Zenodo):
👉 https://zenodo.org/records/18485585

The dataset repository provides:

  • detailed system context,
  • feature descriptions,
  • ground-truth label semantics,
  • statistical summaries,
  • and usage examples.

⚠️ The dataset is released separately and is required to reproduce training or evaluation results.

Model Summary

  • Task: Binary classification (job failure prediction)
  • Framework: PyTorch
  • Training orchestration: Ray Train / Ray Tune
  • Export format: ONNX
  • Inference backend: ONNX Runtime

The model consumes tabular features plus fixed-length vectors describing cluster utilization. Although the dataset distinguishes between different failure causes, the released model produces a binary output:

  • not failed
  • failed

Inputs and Preprocessing

The model expects:

  • scalar numeric features describing cluster utilization and fragmentation,

  • fixed-length vector features representing server and uplink utilization.

All preprocessing steps are defined in bundle.json, including:

  • feature column order,
  • normalization parameters (StandardScaler),
  • vector dimensions.

⚠️ bundle.json must always be treated as the authoritative source of truth for model inputs.

Quick Start

Prerequisites

Install the required Python dependencies:

pip install numpy pandas pyarrow onnxruntime

or

pip install -r requirements.txt

Basic Usage

The src/inference_runtime.py script loads the ONNX model and preprocessing bundle, reads rows from a parquet file, and outputs predictions.

Run inference on the first 1000 rows

python src/inference_runtime.py \ 
  --onnx model/model.onnx \
  --bundle model/bundle.json \
  --parquet model/data.parquet \
  --n 1000
Output format (per row):
0    not failed    proba=0.023456
1    failed        proba=0.987654
2    not failed    proba=0.012345
 

Evaluate metrics (if ground-truth labels are available)

If your parquet file includes the ground-truth label column, you can compute evaluation metrics:

python src/inference_runtime.py \
  --onnx model/model.onnx \
  --bundle model/bundle.json \
  --parquet model/data.parquet \
  --n 1000 \
  --label_col l1_failed
Additional output:
Metrics on loaded rows:
accuracy=0.925980 precision=0.933392 recall=0.910949 f1=0.922034

Command-Line Arguments

The inference script (inference_runtime.py) supports the following command-line arguments:

Argument Required Default Description
--onnx Yes Path to the model.onnx file
--bundle Yes Path to bundle.json containing preprocessing metadata
--parquet Yes Path to the input Parquet file
--n No 1000 Number of rows to load from the Parquet file
--label_col No None Name of the ground-truth label column (used only for metrics)

ℹ️ If --label_col is not provided, the script performs inference only and does not compute evaluation metrics.

Notes

  • The exact feature column order and normalization parameters are stored in bundle.json.

Model Constraints

The released model is subject to several explicit constraints that must be respected for correct and meaningful use.

Fixed Input Schema

  • The model expects a fixed set of input features:
    • scalar numeric features,
    • a server utilization bitmap of fixed length,
    • a leaf-switch utilization vector of fixed length.
  • The exact feature order, normalization parameters, and vector lengths are defined in bundle.json.

Fixed Cluster Topology Assumption

  • The model is trained assuming a specific cluster architecture:
    • 32 Scalable Units (SUs),
    • 32 servers per SU (1024 total servers),
    • 8 leaf switches per SU (256 total leaf uplinks).
  • The server and uplink vectors are not dynamically resizable.
  • Applying the model to clusters with:
    • different numbers of servers,
    • different SU layouts,
    • or different network topologies
      requires retraining or careful feature remapping and validation.

Binary Output Only

  • Although the dataset distinguishes between:
    • server-related failures, and
    • uplink-related failures, the released model produces a binary output only:
      • failed
      • not failed
  • The model does not indicate why a failure is predicted.

Probabilistic Predictions

  • The model outputs a probability of failure, not a deterministic decision.
  • The default classification threshold is 0.5, but:
    • different operational settings may require different thresholds,
    • threshold tuning should consider false-positive vs false-negative trade-offs.
  • Predictions should be interpreted as risk estimates, not guarantees.
  • It is intended to be used as a decision-support component, not as a standalone scheduler.

⚠️ Users integrating this model into larger systems should ensure that all constraints above are satisfied and validated before relying on predictions in operational workflows.

Citation

If you use this model, please cite it using the Zenodo DOI.

Acknowledgements & Funding

This work is part of the MLSysOps project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912.

More information about the project is available at https://mlsysops.eu/

Files

model-job-placemenet-failure-prediction.zip

Files (6.0 MB)

Name Size Download all
md5:390beeb8d021cbe08d1ab4b16692d2f0
6.0 MB Preview Download

Additional details

Funding

European Commission
MLSysOps - Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum 101092912

Software

Repository URL
https://github.com/mlsysops-eu/model-job-placement-failure-prediction
Programming language
Python
Development Status
Active