On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Zhou, Jiahao; Lin, Chengliang; Li, Dingji; Dong, Mingkai; Chen, Haibo

doi:10.5281/zenodo.18809731

Published February 28, 2026 | Version v1

Software Open

On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Semantic top-𝐾 selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-𝐾 selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference.

Building on this insight, we propose monolithic forwarding and develop a training-free inference system, PRISM. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate PRISM against state-of-the-art baselines on rerankers from 0.6 B to 8 B parameters across Apple M2 and RTX 5070. PRISM consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, PRISM lowers latency by 11.6%–51.0% and peak memory by 18.6%–77.8%, demonstrating substantial improvements in efficiency and deployability.

Files

Files (257.7 MB)

Name	Size	Download all
monolithic_forwarding_ae-Final_Version.tar.gz md5:51488e7d59d55b5d6d774cad6e14bcf4	257.7 MB	Download

Additional details

Repository URL: https://ipads.se.sjtu.edu.cn:1312/opensource/monolithic_forwarding_ae
Programming language: Python

	All versions	This version
Views	27	27
Downloads	0	0
Data volume	0 Bytes	0 Bytes

On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Authors/Creators

Description

Files

Files (257.7 MB)

Additional details

Software