What AI Models Can You Run with 8, 12, 16, 24, or 32 GB VRAM?
A practical breakdown of which AI models fit at each VRAM tier. From budget 8 GB cards to the RTX 5090, find the best Ollama models for your exact amount of VRAM.
VRAM is the single number that decides what you can run locally. GPU speed affects how fast tokens come out, but if the model doesn’t fit in VRAM, it either won’t load or will spill into system RAM and crawl. Here’s exactly what fits at every common VRAM tier in 2026, with specific models you can pull from Ollama right now.
Every VRAM figure below comes from our compatibility engine, which tracks actual memory requirements for each model and quantization level. You can check any combination yourself with our compatibility checker.
8 GB — Entry Level
Cards: RTX 4060, RTX 5050, RX 7600
Eight gigabytes gets you into the game, but only with smaller models at aggressive quantization. The models worth running:
- Gemma 3 4B at Q4 — 5.0 GB VRAM. Multimodal (text + images), 128K context, and surprisingly capable for its size. The best option at this tier.
- Qwen 3 8B at Q4 — 7.5 GB VRAM. Tight fit, but it loads. Strong at coding and math, with hybrid thinking mode. You’ll want to close other GPU-using apps first.
- Llama 3.1 8B at Q4 — 6.3 GB VRAM. The reliable workhorse. Good all-around, 128K context, tool-use support.
Skip anything above 8B parameters at this tier. Models like Gemma 3 12B need 10.5 GB at Q4 — that’s an automatic no.
Bottom line: Functional for experimenting and light tasks. You’ll feel the ceiling fast.
See the full list: Best models for 8 GB VRAM
12 GB — The Minimum Good Experience
Cards: RTX 3060 12GB, Intel Arc B580, RX 9060 XT 8GB (close — 8 GB only)
Twelve gigabytes is where local AI starts feeling practical. You can run 8B models at Q8 (noticeably better quality than Q4) or step up to 14B models:
- Qwen 3 8B at Q8 — 11.5 GB VRAM. Higher quality output than Q4 with room to spare. The best daily driver at this tier.
- Gemma 3 12B at Q4 — 10.5 GB VRAM. Multimodal, excellent all-rounder, one of the most popular models on Ollama.
- Phi-4 14B at Q4 — 9.9 GB VRAM. Punches well above its weight on math and reasoning benchmarks (84.8 MMLU).
- DeepSeek R1 14B at Q4 — 9.9 GB VRAM. Chain-of-thought reasoning distilled from the full 671B model. Great for complex problems.
The RTX 3060 12GB is still one of the best value cards for local AI in 2026. Slower inference than newer cards, but the 12 GB of VRAM punches above its price point.
Bottom line: 14B models at Q4 give you genuinely useful output for coding, writing, and reasoning tasks.
See the full list: Best models for 12 GB VRAM
16 GB — The Sweet Spot
Cards: RTX 4060 Ti 16GB, RTX 5060, RX 9060 XT 16GB
Sixteen gigabytes is where most people should aim. You get 14B models at full Q8 quality, or you can stretch into larger models at Q4:
- Qwen 3 14B at Q4 — 12.0 GB VRAM. One of the strongest mid-range models available. Hybrid thinking mode, tool calling, excellent at coding and math.
- DeepSeek R1 14B at Q8 — 16.0 GB VRAM. Full quality chain-of-thought reasoning. A major step up from Q4 for complex problems.
- Phi-4 14B at Q8 — 16.0 GB VRAM. Near-perfect quality on a card that costs under $400.
- Qwen 2.5 Coder 14B at Q4 — 12.0 GB VRAM. Dedicated coding model. If you write code daily, this is worth having alongside a general-purpose model.
- Gemma 3 12B at Q8 — 16.0 GB VRAM. Full quality multimodal with vision support.
At 16 GB you can also run multiple smaller models simultaneously. Keep a coding model and a general-purpose model loaded — Ollama handles switching between them.
Bottom line: The best balance of price, performance, and model access. If you’re buying a GPU specifically for local AI, target this tier.
See the full list: Best models for 16 GB VRAM
24 GB — Serious Tier
Cards: RTX 3090, RTX 4090, RTX 5070 Ti
Twenty-four gigabytes opens the door to 27B–32B models that compete with frontier APIs on many tasks:
- Qwen 3 32B at Q4 — 23.0 GB VRAM. Near-frontier performance across coding, math, reasoning, and creative writing. Hybrid thinking mode lets it match 70B-class models on hard problems.
- DeepSeek R1 32B at Q4 — 20.7 GB VRAM. The best reasoning model you can run on a single consumer GPU. Excels at multi-step math proofs and algorithmic problems.
- Qwen 3 30B-A3B (MoE) at Q4 — 22.0 GB VRAM. A mixture-of-experts model: 30B total parameters, but only 3B activate per token. Loads like a 30B model but generates tokens as fast as a 3B model. The speed is remarkable.
- Gemma 3 27B at Q4 — 20.0 GB VRAM. Multimodal flagship that competes with 65B+ models on benchmarks.
- Qwen 3.5 27B at Q4 — 19.0 GB VRAM. 256K context, 201-language support, vision, and tool use. One of the most versatile models at any size.
The RTX 4090 remains the king of consumer inference in 2026 — 24 GB VRAM plus 1,008 GB/s memory bandwidth means fast tokens from large models. The RTX 3090 is the budget alternative: same 24 GB, slower bandwidth, but often available used for a fraction of the price.
Bottom line: This is where local AI gets genuinely competitive with cloud APIs for most tasks.
See the full list: Best models for 24 GB VRAM
32 GB — The New Ceiling
Cards: RTX 5090
The RTX 5090 brought 32 GB to consumer desktops. That extra 8 GB over the 4090 makes a real difference:
- Qwen 3 32B at Q8 — fits now. Full quality, no compromise. At 24 GB you’re stuck with Q4.
- Gemma 3 27B at Q8 — fits now (34 GB needed — actually this is too tight). Stick with Q4 or Q5 for the best experience.
- Llama 3.3 70B at Q4 — 43.5 GB VRAM. Still doesn’t fully fit, but Ollama can offload layers to system RAM. With a fast CPU and enough system RAM, you’ll get usable (if slower) inference from a 70B model. Expect partial GPU offload rather than full VRAM-only operation.
The real advantage of 32 GB is running 27B–32B models at Q8 instead of Q4. The quality difference is noticeable, especially for nuanced writing, complex code, and reasoning chains. You’re getting the full model intelligence instead of a compressed approximation.
Bottom line: If you can afford the RTX 5090, you get the best single-GPU local AI experience available on consumer hardware.
See the full list: Best models for 32 GB VRAM
Quick Reference Table
| VRAM | Best Models | Quantization | Experience |
|---|---|---|---|
| 8 GB | Gemma 3 4B, Llama 3.1 8B | Q4 | Basic — small models only |
| 12 GB | Qwen 3 8B, Phi-4 14B, DeepSeek R1 14B | Q8 (8B) / Q4 (14B) | Good — usable for real work |
| 16 GB | Qwen 3 14B, DeepSeek R1 14B | Q8 (14B) | Great — sweet spot for most users |
| 24 GB | Qwen 3 32B, DeepSeek R1 32B, Gemma 3 27B | Q4 (32B) | Excellent — competitive with cloud APIs |
| 32 GB | Qwen 3 32B at Q8, 70B with offloading | Q8 (32B) | Best consumer single-GPU experience |
Not Sure What You Have?
Check your GPU’s specs on its hardware page — every GPU in our database shows exactly which models fit at each quantization level. Or use the compatibility checker to test any model against any hardware instantly.