LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Yang, Shang; Guo, Junxian; Tang, Haotian; Hu, Qinghao; Xiao, Guangxuan; Tang, Jiaming; Lin, Yujun; Liu, Zhijian; Lu, Yao; Han, Song

doi:10.5281/zenodo.14989916

Published February 20, 2025 | Version v1

Conference paper Open

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

1. Massachusetts Institute of Technology
2. Shanghai Jiao Tong University
3. Nvidia (United States)

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. The code is released at https://github.com/mit-han-lab/omniserve.

Files

lserve-artifact-docker.zip

Files (124.5 MB)

Name	Size	Download all
lserve-artifact-docker.zip md5:753c2d0472c9c99ce1f97f0aa1ccb615	124.5 MB	Preview Download

Additional details

Accepted: 2025-02-11

Repository URL: https://github.com/mit-han-lab/omniserve

@article{yang2025lserve, title={LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention}, author={Yang*, Shang and Guo*, Junxian and Tang, Haotian and Hu, Qinghao and Xiao, Guangxuan and Tang, Jiaming and Lin, Yujun and Liu, Zhijian and Lu, Yao and Han, Song}, journal={arXiv preprint arXiv:2502.14866}, year={2025} }

	All versions	This version
Views	152	152
Downloads	52	52
Data volume	6.5 GB	6.5 GB

lserve-artifact-docker.zip

Files (124.5 MB)

Dates

Software

References

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Authors/Creators

Description

Files

lserve-artifact-docker.zip

Files (124.5 MB)

Additional details

Dates

Software

References