Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

Dhakal, Anupam; Pokharel, Prashant; Adhikari, Sabin

doi:10.5281/zenodo.18616658

Published February 1, 2026 | Version v1

Journal article Open

Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

For the rapidly evolving field of Large Language Models (LLMs, the rapid scaling has posed significant challenges. These problems include exorbitant energy consumption, prohibitively expensive deployment, and a significant impact on environmental sustainability. A major contributor to this problem is LLMs' colossal size. Typically, there are billions of parameters, and the need for them to be run in resource-scarce or edge environments. Our research delves into a functional and immediately applicable solution to kickstart the energy efficiency of LLMs by merging low-bit-width quantization and streamlined prompt techniques.

We have tested this approach with Llama-based models ranging from hundreds of millions to over one billion parameters and applied 4-bit post-training compression combined with structured prompt and query optimization to this spectrum of models. Utilizing a well-controlled A/B testing framework, we evaluated the task accuracy, delay, and power consumption between our baseline and optimized configurations. Since we can measure the actual power usage of our hardware, we could use the formula accuracy-per-watt to sum up the performance of both configurations. Our results show that 4-bit compression all by itself knocks out a significant portion of memory usage and electricity consumption, and then, our fine-tuning of the prompts cuts down the cost of token-level inference. When used in tandem, these two techniques have led to a 90% reduction in energy consumption with virtually no or statistically insignificant losses in accuracy on the tests we ran.

We also verified the effectiveness of this strategy for real-world use, demonstrating that it delivers consistent efficiency benefits when running on severely constrained hardware. The scalability analysis showed that this method still delivers a lot of bang for the buck even for models that have over a billion parameters.

Files

741-Article Text-1684-1-10-20260204.pdf

Files (404.2 kB)

Name	Size	Download all
741-Article Text-1684-1-10-20260204.pdf md5:f7edd5a07cf2b48e31d5350cca5a935f	404.2 kB	Preview Download

	All versions	This version
Views	28	28
Downloads	9	9
Data volume	8.5 MB	8.5 MB

Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

Authors/Creators

Description

Files

741-Article Text-1684-1-10-20260204.pdf

Files (404.2 kB)