Published February 13, 2025 | Version v1
Thesis Open

Efficient Mixed-Precision Inference for Vision Transformers

  • 1. ROR icon IBM Research - Zurich
  • 2. ROR icon Universitat Politècnica de València
  • 1. ROR icon IBM Research - Zurich
  • 2. Universidad Politécnica de Valencia

Description

Recent advances in deep learning (DL) have been achieved by scaling the number of model parameters and training data. The Transformer is the most widely used DL model architecture in natural language processing. Its key feature is the attention mechanism, which allows the model to focus dynamically on the relevant context in the global latent space. This has led to scaling models in natural language processing with up to 405 billion parameters, equivalent to 810 gigabytes (GB) of memory in 16-bit floating point (FP16). At this level of accuracy, a model of this size requires several accelerator compute nodes. The Transformer architecture has been successfully transferred to the computer vision domain as the Vision Transformer (ViT). Similarly, scaling the number of parameters in the ViT architecture increased the model’s predictive power. In addition, modifications to the ViT architecture led to new architectures such as the Data efficient Image Transformer (DeiT), Swin Transformer, and DeiT3. Each of these improved upon the shortcomings of the original architecture. Nevertheless, these models require substantial computing power and energy to process the images at scale. 

The emergence of Artificial Intelligence (AI) systems and applications will require DL models to run efficiently with a conscious energy consumption, as it is expected that, by 2040, the energy consumed by devices will exceed our energy production capability. To tackle this, we investigate the compression methods for ViT architectures to reduce the model’s required size and translate the computation to more energy-efficient data types. The methods and algorithms presented in this dissertation allow the ViT architectures to operate with less energy consumed and reduced latency at a similar level of predictive performance to the reference model.

Firstly, we comprehensively evaluate the effect of the post-training quantization on the ViT, DeiT, Swin Transformer, and DeiT3 models. We show that ViT and DeiT3, after quantization, lose their predictive power, while DeiT and Swin do not. We hypothesize that the regularization applied during training positively affects the quantization’s robustness. Next, we perform per-layer analysis using the signal-to-quantization-noise ratio (SQNR), which measures the latent signal going through a quantized network compared to the reference FP32 DL model. We show that a correlation exists between the SQNR value and quantization error. Moreover, we propose an easy yet effective post-training quantization method that utilizes mixed-precision computation, allowing us to compress the models up to 90%. As a result, our models approach the fully quantized DL models in size while keeping the predictive performance close to the FP32.

Next, we propose a novel post-training quantization method - Hybrid Quantization (HQ). HQ uses the property of the ViTs, which are mainly composed of linear layers. As a result, we design an automatic algorithm that selects the linear layer for static or dynamic quantization based on the SQNR metric. Our HQ method improves the predictive power performance compared to the static quantization in 12/12 ViT, 3/6 DeiT, 6/6 DeiT3, and 6/6 Swin Transformer models on the ImageNet1K validation dataset. Furthermore, we evaluate the latency of HQ models on three hardware environments: an Intel Xeon 5218 Gold CPU, a mobile Apple A15 Bionic CPU, and an NVIDIA A100 GPU. We observe up to 1.15, 1.28, and 1.68 average speedups compared to the dynamic quantization for ViT models, respectively. 

Lastly, we design and implement a mixed-precision attention mechanism in the Triton language. Our mixed-precision attention mixes an 8-bit integer (INT8) and FP16 computation to achieve higher throughput and numerical stability than the reference implementation of FlashAttention in Triton. We show that using a domain-specific language with a compiler can match heavily specialized GPU kernels. Moreover, we open-source our QAttn framework. In this library, we focus on the integration of the PyTorch post-training quantization ecosystem. We extend the PyTorch quantization with custom kernels for quantized matrix multiplication and our mixed-precision attention. We show that our kernels improve the throughput of the ViT model by up to 7.34 compared to the FP32 reference model. Moreover, our framework generalizes to newer foundation models like the Segment Anything Model (SAM). We achieve over 5x more images processed per second for the base and large variants without mean intersection over union (mIOU) drop over the COCO2017 validation set. 

In summary, in this thesis, we holistically address the problem of post-training quantization of ViT models. We propose a novel method to tackle the quantization of ViT architecture. In addition, we open-source the QAttn framework, which implements quantized GPU kernels in Triton and integrates with the PyTorch framework. In our research, we demonstrate memory and latency reduction compared to the reference DL model. Moreover, our work lays the foundation for further research into the compression of ViT models.

Files

Piotr Kluska PhD Thesis.pdf

Files (17.9 MB)

Name Size Download all
md5:5d177b3848b4f075c60e656f9b590a0c
17.9 MB Preview Download

Additional details

Funding

European Commission
APROPOS – Approximate Computing for Power and Energy Optimisation 956090

Dates

Issued
2025-02-13