Why Quantization Matters

Digital art generated by stable diffusion.
Mon Feb 10 2025
Guillaume Lagrange

Modern deep learning models, such as large language models (LLMs), are heavily constrained by memory bandwidth. GPUs can execute floating-point operations (FLOPs) much faster than they can fetch weights from memory. For instance, an NVIDIA A10 has a peak computation throughput of 125 TFLOPS and a memory bandwidth of 600GB/s. This means that the GPU can perform approximately 200 FLOPs for each byte of memory it loads [1].

In the context of a single matrix multiplication, processing one input takes 2 FLOPs (1 multiply, 1 add) per weight. In FP16, the GPU can execute 400 FLOPS in the time it takes to load a single weight value. Imagine when your model has billions of parameters! The runtime will be dominated by memory reads.

To overcome this bottleneck, we turn to quantization.

By compressing the weights to a lower bitwidth representation (e.g., 4-bit integers), we can dramatically cut memory requirements and improve throughput. For the same matrix multiplication, the GPU now only executes 100 FLOPs in the time it takes to load a single 4-bit weight value — a 4× improvement!

This example is only theoretical of course. Achieving such a speedup involves many challenges, especially without sacrificing accuracy.

What We've Built So Far

To go from a floating-point to a fixed-point representation, we need a scheme to convert floating-point values to integers. A floating-point value x can be expressed approximately as a scalar multiplied by an integer value:

x ≈ x_q * s

where s is a floating-point scale factor and x_q is an integer that represents the quantized version of x. The full range of floating-point values is quite big, so simply mapping the full range of values to a lower bit-width like INT8 would introduce too much error. In practice, we only need to map the range [alpha, beta] of the data at hand. This range is computed during the calibration step to find one that includes as many values as possible while minimizing the quantization error. Burn currently supports the simple MinMaxCalibration which computes the range based on the minimum and maximum values.

In asymmetric or affine quantization, the range of floating-point values [alpha, beta] is mapped to the full bitwidth range [a, b] (e.g., [-128, 127] for INT8). The scheme is defined by the scale factor s and a zero-point z to map a floating point value to the integer range:

x_q = clamp(round(x / s + z), a, b)

To recover the real-value input we can perform dequantization:

x ≈ s * (x_q - z)

In symmetric quantization, the scheme is simplified to restrict the zero-point to 0, leaving only the scaling factor:

x_q = clamp(round(x / s), a, b)

This reduces the computation overhead of dealing with zero-point offset. The real-value input can be recovered with dequantization:

x ≈ x_q * s

So far, we have explained quantization with a single set of parameters. In practice, quantization can be applied with increasing granularity. Per-tensor quantization is the simplest form in which a single set of parameters is defined for all the elements of a tensor. But quantization can be defined for individual segments of a tensor (e.g., output channels).

Burn currently supports per-tensor quantization to 8-bit integers, with both symmetric and affine quantization schemes.

As such, floating-point tensors can be converted to their quantized representation, and vice-versa:


// Quantize the tensor with the given quantization parameters
let x_q = x.quantize(&scheme, qparams);

// Dynamically compute the quantization parameters to quantize the tensor
let x_q = x.quantize_dynamic(&scheme);

// Dequantize the values
let x = x_q.dequantize();
  

This enables users to efficiently store and load quantized weights, reducing memory usage.


// Quantization config
let mut quantizer = Quantizer {
    calibration: MinMaxCalibration {},
    scheme: QuantizationScheme::PerTensorSymmetric(QuantizationType::QInt8),
};

// Quantize the weights
let model = model.quantize_weights(&mut quantizer);
  

More details can be found in the Burn book's section on quantization [2].

What's Next

At this time, tensor operations are not performed directly on the quantized inputs. This means that quantized tensors must be dequantized to floating-point precision before any computations can be performed. So while 8-bit quantization is a good starting point, it's quite inefficient to repeatedly apply dequantization at runtime.

A more efficient implementation should load the quantized values from memory, perform floating-point arithmetic on the dequantized values and store the result in memory in its quantized form. In the best case scenario, the same operation is performed using integer arithmetic when all inputs are quantized, skipping dequantization altogether.

Our next steps focus on enabling efficient inference without unnecessary conversions:

  • Lower bitwidth quantization: We're working on 4-bit integer support, further reducing memory bandwidth requirements and improving performance.
  • Efficient kernels: On-the-fly dequantization can be used to keep weights compressed in GPU memory and dynamically decompress them before computation between integer and floating-point inputs. For integer inputs, we can skip dequantization.
  • Better granularity: Beyond per-tensor quantization, more fine-grained control can be applied to reduce quantization error and improve model accuracy.

By enabling high-performance quantized operations at lower bitwidths (e.g., 4-bit), we're building the foundation for efficient, scalable inference on modern GPUs and accelerators.

In the future, we'll explore the addition of quantization strategies like GPTQ [3], AWQ [4], SpinQuant [5], allowing Burn users to fine-tune and deploy optimized models seamlessly.

References

[1]MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
[2]The Burn Book: Quantization (Beta)
[3]GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
[4]AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[5]SpinQuant: LLM quantization with learned rotations

Stay connected

Join our community! We'd love to keep you in the loop with our newsletter.

unsubscribed