Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to decrease memory usage and increase speed with minimal quality loss.
Quantization reduces the numerical precision of model weights, typically from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. This dramatically reduces memory requirements and can speed up inference.
How quantization works:
Quantization types:
Common precisions:
Quantization can reduce model size by 4-8x, enabling deployment on cheaper hardware or edge devices.
We use quantization to make AI deployment more cost-effective for Australian businesses, running capable models on modest infrastructure.
"Running a quantised model on a laptop instead of requiring cloud GPUs, enabling offline AI capabilities."