Published September 2, 2024 | Version v1
Conference paper Open

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

  • 1. ROR icon National University of Singapore
  • 2. AMD Research, Dublin
  • 3. ROR icon Tampere University
  • 4. AMD Research, Germany
  • 5. AMD Research, Ireland

Contributors

  • 1. AMD Research, Ireland
  • 2. ROR icon National University of Singapore

Description

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats (\code{FP8}) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integer-based quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

Files

Shedding_the_Bits.pdf

Files (2.1 MB)

Name Size Download all
md5:19a021beb0835dc0761957a0cc96c493
2.1 MB Preview Download

Additional details

Funding

European Commission
APROPOS - Approximate Computing for Power and Energy Optimisation 956090
National Research Foundation
Competitive Research Programme NRF-CRP23-2019-0003
Ministry of Education
Academic Research Fund T1 251RES1905

Software

Repository URL
https://github.com/Xilinx/brevitas
Programming language
Python
Development Status
Active