What is Quantization? How It Lets You Run AI on a Regular GPU

If you’ve looked into running AI models locally, you’ve run into terms like Q4_K_M, Q8_0, GGUF, and IQ4_XS. These are quantization formats, and they’re the reason you can run a model that should need 16 GB of VRAM on an 8 GB GPU. Here’s what’s actually going on.

The Core Problem

Take Gemma 3 12B — it has 12 billion parameters. Each parameter is stored as a 16-bit floating point number. That’s 2 bytes per parameter:

12 billion x 2 bytes = 24 GB

That’s more VRAM than most GPUs have, and this isn’t even a large model. A 70B model needs 140 GB in full precision. Clearly, something has to give.

What Quantization Does

Quantization shrinks each parameter by using fewer bits. Instead of a 16-bit float, you store it as a 4-bit or 8-bit integer. The model gets smaller, uses less VRAM, and runs faster — at the cost of some precision.

Here’s what that looks like in practice for a 12B model:

Format	Bits per Weight	Model Size	Quality
F16	16 bits	~24 GB	Original (no loss)
Q8_0	8 bits	~12.7 GB	Nearly identical to original
Q6_K	6.5 bits	~10 GB	Very slight loss
Q5_K_M	5.5 bits	~8.5 GB	Slight loss
Q4_K_M	4.8 bits	~7.5 GB	Minor loss — the default for a reason
IQ4_XS	4.3 bits	~6.7 GB	Noticeable on demanding tasks
IQ3_M	3.4 bits	~5.7 GB	Noticeable on most tasks

The bottom line: Q4_K_M cuts the VRAM requirement by roughly 70% with quality loss that most people can’t consistently detect in normal use. That’s why it’s the standard.

The Format Names Decoded

These names come from llama.cpp and the GGUF file format, which is how most local models are packaged:

Q = Quantized
The number (4, 5, 6, 8) = roughly how many bits per weight
K = K-quant method (smarter quantization that allocates more bits to important layers)
M / S / L = Medium, Small, or Large variant (how aggressively bits are allocated across layers)
IQ = Importance-weighted quantization (an even smarter technique that weights bits by parameter importance)
F16 / F32 = Full precision, no quantization

The Ones You’ll Actually Use

Q4_K_M — The default in Ollama. The right choice for most people most of the time. Benchmarks on Llama 3.1 8B show 92% perplexity retention with a 1.6% quality drop from FP16. In normal conversation, the difference is undetectable.

Q8_0 — Roughly double the VRAM of Q4, with quality that’s essentially identical to the original (~99.96% retention). Upgrade to this if you have VRAM headroom.

Q5_K_M — The middle ground. Worth it when Q4 feels slightly off on complex reasoning but Q8 doesn’t fit.

Q6_K — Close to Q8 quality in a smaller package. Good if you need to squeeze a model in tight.

IQ4_XS — Uses importance-weighted quantization for a slight edge over Q4_K_M at the same size. Inference is ~20% slower though, so only use it if size is the binding constraint.

Where Quality Loss Actually Shows Up

Q4_K_M vs FP16 is nearly invisible for:

Conversation and chat
Following instructions
Summarization
Basic coding tasks

You might notice it on:

Multi-step math and logic chains
Nuanced creative writing
Tasks that require recalling obscure facts
Very long context windows

Below Q4 (IQ3_M, Q3_K, etc.), the degradation gets real. These work for basic conversation but start failing on anything demanding. Use them when you absolutely can’t fit Q4.

The Practical Decision

Here’s the framework:

Pick the largest model that fits at Q4_K_M. A bigger model at Q4 almost always beats a smaller model at Q8. A 14B at Q4 is smarter than an 8B at Q8.
If VRAM is left over, bump the quantization. Go Q5_K_M, Q6_K, or Q8_0.
Don’t go below Q4 unless you have to. Q3 and below is where quality degrades noticeably.

Example for a 16 GB GPU:

Go-to: Qwen3 14B or Llama 4 Scout at Q4_K_M (~10 GB) — leaves room for context
Quality upgrade: Llama 3.1 8B at Q8_0 (~9.5 GB) — smaller model but near-perfect quality
Pushing it: Gemma 3 27B at IQ3_M (~15 GB) — fits, but quality takes a hit

Rule of thumb: Biggest model, Q4_K_M. Leftover VRAM? Increase quantization. Still tight? Try KV cache quantization before dropping model size.

What’s New: KV Cache Quantization

This is a recent Ollama feature that helps with long conversations. When you chat with a model, the conversation history (the KV cache) also takes up VRAM. Normally it’s stored in FP16, but you can quantize it:

OLLAMA_KV_CACHE_TYPE=q8_0    # ~50% KV cache savings, minimal quality loss
OLLAMA_KV_CACHE_TYPE=q4_0    # ~75% savings, some quality loss on very long contexts

This is separate from model quantization — it compresses the conversation memory, not the model weights. Useful when you’re running a large model and need longer conversations.

What’s New: Dynamic Quantization

Tools like Unsloth Dynamic 2.0 now quantize each layer of a model independently, picking the optimal bit-width per layer instead of applying the same quantization everywhere. This squeezes better quality out of fewer bits. Unsloth’s 1-bit dynamic GGUF of DeepSeek V3.1 reportedly outperforms GPT-4.1 on coding benchmarks, which is remarkable for something that fits in 192 GB (down from 671 GB at full precision).

This stuff matters most for very large models on limited hardware. For standard 7B–14B models on a decent GPU, regular Q4_K_M is still perfectly fine.

How Ollama Handles It

Ollama defaults to Q4_K_M. Flash attention is enabled by default, which helps with memory efficiency. To run a specific quantization:

ollama run gemma3:12b          # Q4_K_M default
ollama run gemma3:12b-q8_0     # Explicitly Q8

Check available quantizations for any model on its Ollama library page, or see our model pages which list VRAM requirements for every quantization.

Macs and Unified Memory

Apple Silicon Macs use unified memory — the CPU and GPU share the same pool. A MacBook Pro with 48 GB of unified memory effectively has 48 GB of “VRAM.” This means you can load models that would require expensive desktop GPUs. The M5 chips (shipping since late 2025) are about 20–27% faster at token generation versus M4, and the upcoming M5 Max should push things further.

The tradeoff: Apple’s GPU is still slower per-token than a dedicated NVIDIA card. But the ability to load a 70B model on a laptop remains hard to beat.

Next Steps

Use our compatibility checker to see which models fit your hardware at each quantization
Read the quantization reference guide for technical details on every format
Browse all models with per-quantization VRAM requirements