Quantization
Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to decrease memory usage and increase speed with minimal quality loss.
In-Depth Explanation
Quantization reduces the numerical precision of model weights, typically from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. This dramatically reduces memory requirements and can speed up inference.
How quantization works:
- Map continuous weight values to discrete levels
- Store weights with fewer bits
- Dequantise during computation (or compute in lower precision)
- Trade-off: precision vs efficiency
Quantization types:
- Post-training quantization (PTQ): Quantise after training
- Quantization-aware training (QAT): Train with quantisation in mind
- Dynamic quantization: Quantise activations on-the-fly
- Static quantization: Pre-computed quantization parameters
Common precisions:
- FP32: Standard training precision
- FP16/BF16: Half precision (common for inference)
- INT8: 8-bit integer (4x smaller than FP32)
- INT4/NF4: 4-bit (8x smaller than FP32)
Business Context
Quantization can reduce model size by 4-8x, enabling deployment on cheaper hardware or edge devices.
How Clever Ops Uses This
We use quantization to make AI deployment more cost-effective for Australian businesses, running capable models on modest infrastructure.
Example Use Case
"Running a quantised model on a laptop instead of requiring cloud GPUs, enabling offline AI capabilities."
Frequently Asked Questions
Related Terms
Related Resources
Inference
Using a trained model to make predictions or generate outputs on new data. This ...
Parameters
The learned values (weights and biases) in a neural network that determine its b...
QLoRA (Quantized LoRA)
An even more efficient fine-tuning technique that combines quantisation with LoR...
Learning Centre
Guides, articles, and resources on AI and automation.
AI & Automation Services
Explore our full AI automation service offering.
AI Readiness Assessment
Check if your business is ready for AI automation.
