Q

Quantization

Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to decrease memory usage and increase speed with minimal quality loss.

In-Depth Explanation

Quantization reduces the numerical precision of model weights, typically from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. This dramatically reduces memory requirements and can speed up inference.

How quantization works:

  • Map continuous weight values to discrete levels
  • Store weights with fewer bits
  • Dequantise during computation (or compute in lower precision)
  • Trade-off: precision vs efficiency

Quantization types:

  • Post-training quantization (PTQ): Quantise after training
  • Quantization-aware training (QAT): Train with quantisation in mind
  • Dynamic quantization: Quantise activations on-the-fly
  • Static quantization: Pre-computed quantization parameters

Common precisions:

  • FP32: Standard training precision
  • FP16/BF16: Half precision (common for inference)
  • INT8: 8-bit integer (4x smaller than FP32)
  • INT4/NF4: 4-bit (8x smaller than FP32)

Business Context

Quantization can reduce model size by 4-8x, enabling deployment on cheaper hardware or edge devices.

How Clever Ops Uses This

We use quantization to make AI deployment more cost-effective for Australian businesses, running capable models on modest infrastructure.

Example Use Case

"Running a quantised model on a laptop instead of requiring cloud GPUs, enabling offline AI capabilities."

Frequently Asked Questions

Category

tools

Need Expert Help?

Understanding is the first step. Let our experts help you implement AI solutions for your business.

Ready to Implement AI?

Understanding the terminology is just the first step. Our experts can help you implement AI solutions tailored to your business needs.

FT Fast 500 APAC Winner|500+ Implementations|Harvard-Educated Team