Published May 7, 2024 | Version 0.0.1
Software Open

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Description

vAttention is a simple, performant and more portable dynamic memory manager for serving large language models. Leveraging CUDA support for demand paging, vAttention stores KV cache in contiguous virtual memory and uses on-demand allocation for physical memory. In doing so, we also introduce various LLM-specific optimizations to address the latency and fragmentation challenges that arise when using demand paging to serve LLMs on GPUs. vAttention supports various attention kernels out-of-the-box and significantly improves LLM serving throughput compared to using the state-of-the-art PagedAttention based kernels of FlashAttention and FlashInfer.

Files

vattention_artifact_asplos25.zip

Files (36.4 MB)

Name Size Download all
md5:06138e79a3269baa77179b3e99a02f74
36.4 MB Preview Download

Additional details

Software

Repository URL
https://github.com/microsoft/vattention
Programming language
Python
Development Status
Active