Published April 11, 2025 | Version v1
Dataset Open

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USING BITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

  • 1. Dept. of Manufacturing Engineering and Industrial Management, COEP Technological University Pune, India.
  • 2. Dept. of Mechanical Engineering, COEP Technological University, Pune, India.
  • 3. Dept. of Computer Science & IT, COEP Technological University, Pune, India.

Description

Large Language Models (LLMs) have transformed natural language processing, which has achieved state-of-the-art performance on various tasks. However, their high computational and memory requirements lead to significant challenges for deployment, especially on resource-constrained hardware. In this paper, we conduct a controlled experiment to optimize the LLaMA 3.2 1B model using post-training quantization techniques implemented using the Bitsandbytes library. Evaluating multiple precision settings like BF16, FP16, INT8, and INT4 compare their accuracy, throughput, latency, and resource utilization tradeoffs. Experiments are conducted on a workstation GPU (NVIDIA T1000) for accuracy benchmarking and a cloud-based GPU (Nvidia T4 on Google Colab) for performance benchmarking. Our findings show that lower precision quantization can significantly reduce memory usage and improve throughput with minimal impact on model accuracy, providing valuable insights for efficient AI deployment for production environments.

 

Files

538.pdf

Files (605.5 kB)

Name Size Download all
md5:1f258e9cbe5e8972ed6ee5926a3de40c
605.5 kB Preview Download