Published June 29, 2026 | Version 2.0.0

Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering

Authors/Creators

  • 1. Core Epoch LLC

Description

Standard quantization tools for the ONNX ecosystem inject QuantizeLinear and DequantizeLinear (QDQ) nodes into computation graphs using a local, node-by-node strategy that ignores the broader graph topology. We identify a critical consequence of this approach: by placing a DequantizeLinear node between a Convolution and its subsequent Activation (e.g., ReLU), the quantizer severs their contiguity, prevents runtime kernel fusion, and introduces redundant quantization round-trips. We prove that for activations satisfying f(αx) = αf(x) for α > 0 — including ReLU, LeakyReLU, and Clip(0, M) — the DequantizeLinear node can be safely commuted past the activation when the zero-point is zero, restoring the fusible pattern and eliminating the redundant round-trip. We implement this transformation in Kenosis, a Rust-based ONNX graph optimizer that integrates a comprehensive suite of pipeline-level quantization features, and evaluate the specific impact of this graph placement optimization against a controlled ablation using identical quantization parameters but naive QDQ placement. On stock ONNX Runtime 1.24, across three classifier architectures evaluated on a 1,000-image validation set for each classifier, fusion-aware placement achieves up to 1.49x higher throughput than naive placement (corresponding to a 33% reduction in latency) with identical weights and scales, and speedups of up to 2.42x over FP32 baselines. On MobileNetV2, naive placement yields a quantized model that is 12% slower than the FP32 baseline; fusion-aware placement restores a 25% speedup using the same quantized weights.

Notes (English)

v2 (June 2026): Updated license information; revised Software Availability section.

Files

paper.pdf

Files (218.4 kB)

Name Size Download all
md5:d6e18e718fcaf060c0274cb87095a424
218.4 kB Preview Download

Additional details

Software

Programming language
Rust
Development Status
Active

References

  • Jacob, B., Kligys, S., Chen, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proc. CVPR, pp. 2704–2713.
  • Bai, J., Lu, F., Zhang, K., et al. (2019). ONNX: Open Neural Network Exchange. GitHub. https://github.com/onnx/onnx
  • ONNX Runtime Authors. (2026). ONNX Runtime: Cross-platform, High Performance ML Inferencing Engine. Microsoft. https://onnxruntime.ai
  • NVIDIA Corporation. (2025). TensorRT Developer Guide: Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/
  • Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252.