Published March 2, 2026 | Version 1.0
Preprint Open

The Latency Floor Model: Predictive Performance Diagnostics for Real-Time AI Inference on Edge Accelerators

Authors/Creators

  • 1. DeviceNexus.ai

Description

Large neural network inference on edge and embedded accelerators is frequently optimized through empirical trial-and-error, despite being fundamentally constrained by hardware performance limits. Practitioners lack simple analytical tools for predicting whether inference latency is limited by memory bandwidth, compute throughput, or system overhead prior to optimization.

We present the Latency Floor Model, a practitioner-oriented diagnostic framework that applies roofline-style analysis [Williams et al., 2009] to real-time AI inference at the granularity of individual inference steps. The framework derives three latency floors from model and hardware specifications: a weight-read bound (R1), a key–value cache bandwidth bound (R2), and a compute throughput bound (R3). By comparing measured latency against these floors, practitioners can classify workloads into bandwidth-bound, compute-bound, or overhead-bound regimes without profiling infrastructure.

The primary contribution is not the floor calculations themselves (which follow directly from the roofline model expressed in latency units), but rather the diagnostic workflow built on top: a systematic method for selecting optimization techniques, predicting their impact, and identifying a phase transition where quantization shifts the dominant bottleneck from bandwidth to overhead, causing further quantization to yield diminishing or negative returns.

We validate the framework on two workloads across two NVIDIA edge platforms: a 7B-parameter speech-to-speech transformer and a multi-stage voice+vision inference pipeline. The framework predicted FP8 quantization impact within 5% of measured values, correctly predicted the failure of 8 optimization techniques before implementation, and identified heterogeneous bottleneck regimes within a single multi-stage pipeline.

Our results suggest that inference optimization on edge hardware can be approached as a bounded diagnostic problem rather than an empirical tuning process.

Files

paper.pdf

Files (795.9 kB)

Name Size Download all
md5:40ffb4d212ef7ffeb9db758ff4b15733
795.9 kB Preview Download