Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering
Description
Standard quantization tools for the ONNX ecosystem inject QuantizeLinear and DequantizeLinear (QDQ) nodes into computation graphs using a local, node-by-node strategy that ignores the broader graph topology. We identify a critical consequence of this approach: by placing a DequantizeLinear node between a Convolution and its subsequent Activation (e.g., ReLU), the quantizer severs their contiguity, prevents runtime kernel fusion, and introduces redundant quantization round-trips. We prove that for activations satisfying f(αx) = αf(x) for α > 0 — including ReLU, LeakyReLU, and Clip(0, M) — the DequantizeLinear node can be safely commuted past the activation when the zero-point is zero, restoring the fusible pattern and eliminating the redundant round-trip. We implement this transformation in Kenosis, a Rust-based ONNX graph optimizer that integrates a comprehensive suite of pipeline-level quantization features, and evaluate the specific impact of this graph placement optimization against a controlled ablation using identical quantization parameters but naive QDQ placement. On stock ONNX Runtime 1.24, across three classifier architectures evaluated on a 1,000-image validation set for each classifier, fusion-aware placement achieves up to 1.49x higher throughput than naive placement (corresponding to a 33% reduction in latency) with identical weights and scales, and speedups of up to 2.42x over FP32 baselines. On MobileNetV2, naive placement yields a quantized model that is 12% slower than the FP32 baseline; fusion-aware placement restores a 25% speedup using the same quantized weights.
Notes (English)
Files
paper.pdf
Files
(218.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:d6e18e718fcaf060c0274cb87095a424
|
218.4 kB | Preview Download |
Additional details
References
- Jacob, B., Kligys, S., Chen, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proc. CVPR, pp. 2704–2713.
- Bai, J., Lu, F., Zhang, K., et al. (2019). ONNX: Open Neural Network Exchange. GitHub. https://github.com/onnx/onnx
- ONNX Runtime Authors. (2026). ONNX Runtime: Cross-platform, High Performance ML Inferencing Engine. Microsoft. https://onnxruntime.ai
- NVIDIA Corporation. (2025). TensorRT Developer Guide: Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/
- Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252.