explainers Feb 5, 2026

What is Quantization? How It Lets You Run AI on a Regular GPU

A plain-English explanation of AI model quantization — what Q4_K_M, Q8_0, and the newer formats actually mean, and how to pick the right one for your hardware.

If you’ve looked into running AI models locally, you’ve run into terms like Q4_K_M, Q8_0, GGUF, and IQ4_XS. These are quantization formats, and they’re the reason you can run a model that should need 16 GB of VRAM on an 8 GB GPU. Here’s what’s actually going on.

The Core Problem

Take Gemma 3 12B — it has 12 billion parameters. Each parameter is stored as a 16-bit floating point number. That’s 2 bytes per parameter:

12 billion x 2 bytes = 24 GB

That’s more VRAM than most GPUs have, and this isn’t even a large model. A 70B model needs 140 GB in full precision. Clearly, something has to give.

What Quantization Does

Quantization shrinks each parameter by using fewer bits. Instead of a 16-bit float, you store it as a 4-bit or 8-bit integer. The model gets smaller, uses less VRAM, and runs faster — at the cost of some precision.

Here’s what that looks like in practice for a 12B model:

FormatBits per WeightModel SizeQuality
F1616 bits~24 GBOriginal (no loss)
Q8_08 bits~12.7 GBNearly identical to original
Q6_K6.5 bits~10 GBVery slight loss
Q5_K_M5.5 bits~8.5 GBSlight loss
Q4_K_M4.8 bits~7.5 GBMinor loss — the default for a reason
IQ4_XS4.3 bits~6.7 GBNoticeable on demanding tasks
IQ3_M3.4 bits~5.7 GBNoticeable on most tasks

The bottom line: Q4_K_M cuts the VRAM requirement by roughly 70% with quality loss that most people can’t consistently detect in normal use. That’s why it’s the standard.

The Format Names Decoded

These names come from llama.cpp and the GGUF file format, which is how most local models are packaged:

  • Q = Quantized
  • The number (4, 5, 6, 8) = roughly how many bits per weight
  • K = K-quant method (smarter quantization that allocates more bits to important layers)
  • M / S / L = Medium, Small, or Large variant (how aggressively bits are allocated across layers)
  • IQ = Importance-weighted quantization (an even smarter technique that weights bits by parameter importance)
  • F16 / F32 = Full precision, no quantization

The Ones You’ll Actually Use

Q4_K_M — The default in Ollama. The right choice for most people most of the time. Benchmarks on Llama 3.1 8B show 92% perplexity retention with a 1.6% quality drop from FP16. In normal conversation, the difference is undetectable.

Q8_0 — Roughly double the VRAM of Q4, with quality that’s essentially identical to the original (~99.96% retention). Upgrade to this if you have VRAM headroom.

Q5_K_M — The middle ground. Worth it when Q4 feels slightly off on complex reasoning but Q8 doesn’t fit.

Q6_K — Close to Q8 quality in a smaller package. Good if you need to squeeze a model in tight.

IQ4_XS — Uses importance-weighted quantization for a slight edge over Q4_K_M at the same size. Inference is ~20% slower though, so only use it if size is the binding constraint.

Where Quality Loss Actually Shows Up

Q4_K_M vs FP16 is nearly invisible for:

  • Conversation and chat
  • Following instructions
  • Summarization
  • Basic coding tasks

You might notice it on:

  • Multi-step math and logic chains
  • Nuanced creative writing
  • Tasks that require recalling obscure facts
  • Very long context windows

Below Q4 (IQ3_M, Q3_K, etc.), the degradation gets real. These work for basic conversation but start failing on anything demanding. Use them when you absolutely can’t fit Q4.

The Practical Decision

Here’s the framework:

  1. Pick the largest model that fits at Q4_K_M. A bigger model at Q4 almost always beats a smaller model at Q8. A 14B at Q4 is smarter than an 8B at Q8.
  2. If VRAM is left over, bump the quantization. Go Q5_K_M, Q6_K, or Q8_0.
  3. Don’t go below Q4 unless you have to. Q3 and below is where quality degrades noticeably.

Example for a 16 GB GPU:

  • Go-to: Qwen3 14B or Llama 4 Scout at Q4_K_M (~10 GB) — leaves room for context
  • Quality upgrade: Llama 3.1 8B at Q8_0 (~9.5 GB) — smaller model but near-perfect quality
  • Pushing it: Gemma 3 27B at IQ3_M (~15 GB) — fits, but quality takes a hit

Rule of thumb: Biggest model, Q4_K_M. Leftover VRAM? Increase quantization. Still tight? Try KV cache quantization before dropping model size.

What’s New: KV Cache Quantization

This is a recent Ollama feature that helps with long conversations. When you chat with a model, the conversation history (the KV cache) also takes up VRAM. Normally it’s stored in FP16, but you can quantize it:

OLLAMA_KV_CACHE_TYPE=q8_0    # ~50% KV cache savings, minimal quality loss
OLLAMA_KV_CACHE_TYPE=q4_0    # ~75% savings, some quality loss on very long contexts

This is separate from model quantization — it compresses the conversation memory, not the model weights. Useful when you’re running a large model and need longer conversations.

What’s New: Dynamic Quantization

Tools like Unsloth Dynamic 2.0 now quantize each layer of a model independently, picking the optimal bit-width per layer instead of applying the same quantization everywhere. This squeezes better quality out of fewer bits. Unsloth’s 1-bit dynamic GGUF of DeepSeek V3.1 reportedly outperforms GPT-4.1 on coding benchmarks, which is remarkable for something that fits in 192 GB (down from 671 GB at full precision).

This stuff matters most for very large models on limited hardware. For standard 7B–14B models on a decent GPU, regular Q4_K_M is still perfectly fine.

How Ollama Handles It

Ollama defaults to Q4_K_M. Flash attention is enabled by default, which helps with memory efficiency. To run a specific quantization:

ollama run gemma3:12b          # Q4_K_M default
ollama run gemma3:12b-q8_0     # Explicitly Q8

Check available quantizations for any model on its Ollama library page, or see our model pages which list VRAM requirements for every quantization.

Macs and Unified Memory

Apple Silicon Macs use unified memory — the CPU and GPU share the same pool. A MacBook Pro with 48 GB of unified memory effectively has 48 GB of “VRAM.” This means you can load models that would require expensive desktop GPUs. The M5 chips (shipping since late 2025) are about 20–27% faster at token generation versus M4, and the upcoming M5 Max should push things further.

The tradeoff: Apple’s GPU is still slower per-token than a dedicated NVIDIA card. But the ability to load a 70B model on a laptop remains hard to beat.

Next Steps