Quantization Guide
Quantization reduces the precision of a model's weights, shrinking file size and VRAM requirements while preserving most of the model's capability. It's what makes it possible to run 70B parameter models on consumer hardware.
What is Quantization?
AI models store their knowledge as billions of numerical weights. At full precision (FP16), each weight takes 16 bits (2 bytes). A 70B parameter model at FP16 needs ~140GB of memory — far more than any consumer GPU has.
Quantization compresses these weights to fewer bits per weight. For example, Q4_K_M uses roughly 4.8 bits per weight, reducing that 140GB model to around 40GB — small enough to fit in a Mac with 48GB unified memory.
The key insight is that most weights don't need full precision. Quantization methods like GGML's K-quants use mixed precision, keeping important weights at higher precision while aggressively compressing less important ones.
Quantization Formats
| Format | Bits/Weight | VRAM Multiplier | Quality Impact | Best For |
|---|---|---|---|---|
| IQ4_XS | 4.25 | 0.48x | Comparable to Q4_K_S in quality despite smaller size, thanks to importance-weighted quantization. May show slightly more degradation on some tasks. | Users with limited VRAM who want the smallest practical model size while maintaining reasonable quality. Good for constrained hardware environments. |
| Q4_K_S | 4.58 | 0.52x | Slightly more quality loss than Q4_K_M, but still very usable for most tasks. | Users who need a slightly smaller model than Q4_K_M and can tolerate minimal additional quality loss. |
| Q4_K_M | 4.83 | 0.55x | Minimal quality loss for most tasks. Widely considered the best default choice for local inference. | General-purpose local inference. Best default choice for most users and hardware configurations. |
| Q5_K_S | 5.53 | 0.63x | Low quality loss. Slightly more degradation than Q5_K_M but still superior to Q4 variants. | Users who want Q5-level quality with a slightly smaller footprint than Q5_K_M. |
| Q5_K_M | 5.68 | 0.65x | Very low quality loss. Noticeably better than Q4 quantizations for tasks requiring nuance and accuracy. | Users with sufficient VRAM who want better quality than Q4 without the full cost of Q6 or Q8. |
| Q6_K | 6.56 | 0.75x | Very minimal quality loss. Nearly indistinguishable from the original model for most tasks. | Users with ample VRAM who prioritize quality and want meaningful compression over F16/F32. |
| Q8_0 | 8.5 | 0.95x | Near-lossless. Virtually indistinguishable from the full-precision model in quality. | Users who want maximum quality with some size reduction, and have sufficient VRAM to accommodate larger models. |
| F16 | 16 | 1.9x | No quality loss. This is the original model precision as trained and released. | Users who need full model fidelity and have sufficient VRAM. Often used as a baseline for quality comparisons. |
| F32 | 32 | 3.8x | No quality loss. Maximum numerical precision, though practically identical to F16 for inference. | Training, fine-tuning, or research purposes. Rarely used for inference due to excessive memory requirements with no practical quality benefit over F16. |
Which Quantization Should I Use?
Best Overall: Q4_K_M
The sweet spot for most users. Excellent quality-to-size ratio with minimal perceptible quality loss. This is the default recommendation for most models.
Best Quality: Q8_0
Near-lossless quality. Use this if you have enough VRAM and want the best possible output. Recommended for smaller models (7B-14B) where VRAM isn't a concern.
Maximum Compression: IQ4_XS
When every GB counts. Uses importance matrix quantization to preserve the most critical weights. Slight quality degradation but enables running larger models.
Full Precision: F16
No quantization — use the model exactly as released. Only practical for small models (under 14B) unless you have enterprise-grade hardware.
VRAM Estimation Formula
VRAM (GB) ≈ Parameters (B) × VRAM Multiplier + 1.5 GB overhead
For example, a 70B model at Q4_K_M (multiplier 0.55):
70 × 0.55 + 1.5 = 40.0 GB VRAM needed
Note: Actual VRAM usage varies based on context length, batch size, and runtime overhead. These estimates include a safety margin.