vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Authors/Creators
Description
vAttention is a simple, performant and more portable dynamic memory manager for serving large language models. Leveraging CUDA support for demand paging, vAttention stores KV cache in contiguous virtual memory and uses on-demand allocation for physical memory. In doing so, we also introduce various LLM-specific optimizations to address the latency and fragmentation challenges that arise when using demand paging to serve LLMs on GPUs. vAttention supports various attention kernels out-of-the-box and significantly improves LLM serving throughput compared to using the state-of-the-art PagedAttention based kernels of FlashAttention and FlashInfer.
Files
vattention_artifact_asplos25.zip
Files
(36.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:06138e79a3269baa77179b3e99a02f74
|
36.4 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/microsoft/vattention
- Programming language
- Python
- Development Status
- Active