OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USING BITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

Neeraj Maddel; Shantipal Ohol; Anish Khobragade

doi:10.21474/IJAR01/20538

Published April 11, 2025 | Version v1

Dataset Open

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USING BITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

1. Dept. of Manufacturing Engineering and Industrial Management, COEP Technological University Pune, India.
2. Dept. of Mechanical Engineering, COEP Technological University, Pune, India.
3. Dept. of Computer Science & IT, COEP Technological University, Pune, India.

Large Language Models (LLMs) have transformed natural language processing, which has achieved state-of-the-art performance on various tasks. However, their high computational and memory requirements lead to significant challenges for deployment, especially on resource-constrained hardware. In this paper, we conduct a controlled experiment to optimize the LLaMA 3.2 1B model using post-training quantization techniques implemented using the Bitsandbytes library. Evaluating multiple precision settings like BF16, FP16, INT8, and INT4 compare their accuracy, throughput, latency, and resource utilization tradeoffs. Experiments are conducted on a workstation GPU (NVIDIA T1000) for accuracy benchmarking and a cloud-based GPU (Nvidia T4 on Google Colab) for performance benchmarking. Our findings show that lower precision quantization can significantly reduce memory usage and improve throughput with minimal impact on model accuracy, providing valuable insights for efficient AI deployment for production environments.

Files

538.pdf

Files (605.5 kB)

Name	Size	Download all
538.pdf md5:1f258e9cbe5e8972ed6ee5926a3de40c	605.5 kB	Preview Download

	All versions	This version
Views	15	15
Downloads	20	20
Data volume	13.9 MB	13.9 MB

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USING BITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

Creators

Description

Files

538.pdf

Files (605.5 kB)