Published February 4, 2026 | Version version 1
Preprint Open

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine

Authors/Creators

  • 1. EDMO icon Hawai'i Pacific University

Description

Deploying large language model (LLM) inference has become a core challenge in artificial intelligence.
Current mainstream inference frameworks are primarily optimized for datacenter-grade hardware, with
relatively insufficient support for consumer-grade multi-GPU systems. This paper systematically
surveys key techniques in LLM inference optimization, including FlashAttention, PagedAttention, KV
Cache management, and multi-GPU parallelism strategies, while providing an in-depth analysis of the
technical characteristics and limitations of existing inference frameworks. Building upon this
foundation, we propose and implement Ember—a lightweight CUDA inference engine specifically
optimized for consumer-grade multi-GPU systems. Ember employs a Pipeline Parallelism strategy with
Chunked Prefill with Overlap technique to reduce PCIe communication exposure. Experimental results
demonstrate that on a dual NVIDIA RTX 3080 Ti configuration, Ember achieves a 1.16× dual-GPU
speedup ratio, while llama.cpp's Layer Split strategy only achieves 1.01× speedup, validating the
effectiveness of our proposed approach. Compared to ExLlamaV3's Tensor Parallel strategy, Ember's
Pipeline Parallel approach achieves comparable scaling efficiency while maintaining lower time-to-
first-token latency growth. This research provides practical guidance for LLM inference on consumer-
grade hardware and demonstrates the commercial application potential in this field.

Files

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine.pdf

Additional details

Software

Repository URL
https://github.com/dongchany/ember
Programming language
C++ , Cuda , Python
Development Status
Active