Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine

Ye, Dongcheng

doi:10.5281/zenodo.18477269

Published February 4, 2026 | Version version 1

Preprint Open

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine

Ye, Dongcheng¹

1. Hawai'i Pacific University

Deploying large language model (LLM) inference has become a core challenge in artificial intelligence.
Current mainstream inference frameworks are primarily optimized for datacenter-grade hardware, with
relatively insufficient support for consumer-grade multi-GPU systems. This paper systematically
surveys key techniques in LLM inference optimization, including FlashAttention, PagedAttention, KV
Cache management, and multi-GPU parallelism strategies, while providing an in-depth analysis of the
technical characteristics and limitations of existing inference frameworks. Building upon this
foundation, we propose and implement Ember—a lightweight CUDA inference engine specifically
optimized for consumer-grade multi-GPU systems. Ember employs a Pipeline Parallelism strategy with
Chunked Prefill with Overlap technique to reduce PCIe communication exposure. Experimental results
demonstrate that on a dual NVIDIA RTX 3080 Ti configuration, Ember achieves a 1.16× dual-GPU
speedup ratio, while llama.cpp's Layer Split strategy only achieves 1.01× speedup, validating the
effectiveness of our proposed approach. Compared to ExLlamaV3's Tensor Parallel strategy, Ember's
Pipeline Parallel approach achieves comparable scaling efficiency while maintaining lower time-to-
first-token latency growth. This research provides practical guidance for LLM inference on consumer-
grade hardware and demonstrates the commercial application potential in this field.

Files

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine.pdf

Files (405.1 kB)

Name	Size	Download all
Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine.pdf md5:4c32b6f0b1cf400a73ddd39cbddfad5b	405.1 kB	Preview Download

Additional details

Repository URL: https://github.com/dongchany/ember
Programming language: C++ , Cuda , Python
Development Status: Active

	All versions	This version
Views	41	41
Downloads	18	18
Data volume	8.9 MB	8.9 MB

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine

Authors/Creators

Description

Files

Efficient Large Language Model Inference on Consumer-Grade Multi-GPU Systems: A Comprehensive Survey and Implementation of the Ember Engine.pdf

Files (405.1 kB)

Additional details

Software