Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine
Description
Deploying large language model (LLM) inference has become a core challenge in artificial intelligence.
Current mainstream inference frameworks are primarily optimized for datacenter-grade hardware, with
relatively insufficient support for consumer-grade multi-GPU systems. This paper systematically
surveys key techniques in LLM inference optimization, including FlashAttention, PagedAttention, KV
Cache management, and multi-GPU parallelism strategies, while providing an in-depth analysis of the
technical characteristics and limitations of existing inference frameworks. Building upon this
foundation, we propose and implement Ember—a lightweight CUDA inference engine specifically
optimized for consumer-grade multi-GPU systems. Ember employs a Pipeline Parallelism strategy with
Chunked Prefill with Overlap technique to reduce PCIe communication exposure. Experimental results
demonstrate that on a dual NVIDIA RTX 3080 Ti configuration, Ember achieves a 1.16× dual-GPU
speedup ratio, while llama.cpp's Layer Split strategy only achieves 1.01× speedup, validating the
effectiveness of our proposed approach. Compared to ExLlamaV3's Tensor Parallel strategy, Ember's
Pipeline Parallel approach achieves comparable scaling efficiency while maintaining lower time-to-
first-token latency growth. This research provides practical guidance for LLM inference on consumer-
grade hardware and demonstrates the commercial application potential in this field.
Files
Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine.pdf
Files
(405.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4c32b6f0b1cf400a73ddd39cbddfad5b
|
405.1 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/dongchany/ember
- Programming language
- C++ , Cuda , Python
- Development Status
- Active