guides Feb 10, 2026

How to Run Llama Locally: From Llama 3 to Llama 4

Step-by-step guide to running Meta's Llama models on your own hardware with Ollama — covering Llama 3.x, Llama 4 Scout, and how to pick the right model for your GPU.

Meta’s Llama family has grown a lot. Llama 3 is still everywhere. Llama 4 brought mixture-of-experts and multimodal input. And you can run all of it on your own machine without paying for API calls or sending data to someone else’s server. Here’s how to get set up.

What’s Available Now

Quick overview of where the Llama lineup stands in early 2026:

Llama 4 (released April 2025) — Meta’s latest. Uses a mixture-of-experts (MoE) architecture, meaning only a fraction of the model’s total parameters are active per token. Natively multimodal (text + images).

  • Scout: 109B total params, 17B active. 10M token context window. Fits on consumer hardware.
  • Maverick: 400B total params, 17B active. 1M token context window. Needs multi-GPU or large unified memory.

Llama 3.x (2024) — Still widely used and well-supported:

  • 3.2 1B/3B — Lightweight models for edge devices
  • 3.1 8B — The workhorse. Runs on 8 GB GPUs.
  • 3.3 70B — Matched the 405B model on many benchmarks

All of these are on Ollama and ready to run.

VRAM Requirements

ModelQuantizationVRAM NeededNotes
Llama 3.2 1BQ4_K_M~1.5 GBFits basically anywhere
Llama 3.2 3BQ4_K_M~2.5 GBLight tasks, edge devices
Llama 3.1 8BQ4_K_M~5.5 GBBest starting point
Llama 3.1 8BQ8_0~9.5 GBHigher quality if VRAM allows
Llama 4 ScoutQ4_K_M~10 GBMoE, only 17B active at a time
Llama 3.3 70BQ4_K_M~43 GBNeeds 48GB+ VRAM or CPU offload
Llama 4 MaverickQ4_K_M~80 GB+Multi-GPU or 128GB+ Mac

Not sure if your hardware can handle a specific model? Use our compatibility checker.

Step 1: Install Ollama

Ollama handles model downloads, VRAM management, and gives you a CLI to chat or serve an API. As of early 2026, it also has a native desktop app for macOS and Windows with a chat UI, conversation history, and drag-and-drop for images and PDFs.

macOS

brew install ollama

Or grab the desktop app from ollama.com.

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download.

Verify the install:

ollama --version

You should see something like ollama version 0.16.x.

Step 2: Run Your First Model

One command to download and start chatting:

ollama run llama3.1:8b

First run downloads the model (~4.7 GB at Q4_K_M). After that, it starts instantly. Type your message, hit Enter, and you’ll get a response. Type /bye to exit.

If you want to try Llama 4 Scout instead:

ollama run llama4:scout

Scout is multimodal — you can drag an image into the Ollama desktop app and ask questions about it, or use the API to send images alongside text.

Step 3: Pick the Right Model for Your Hardware

Here’s the honest recommendation based on what you’re working with:

8 GB VRAM (RTX 3060 8GB, RTX 4060, M1/M2 8GB Mac)

Stick with Llama 3.1 8B at Q4_K_M. It fits with a bit of headroom and is genuinely capable for conversation, summarization, and light coding tasks.

ollama run llama3.1:8b

12–16 GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB, RTX 5070, M-series Mac 16GB)

You have options. Llama 4 Scout is worth trying — the MoE architecture means only 17B parameters are active per token, so it fits in ~10 GB at Q4 while being smarter than a standard 14B dense model. Also try Gemma 3 12B or Qwen3 14B if you want to compare.

ollama run llama4:scout

24 GB VRAM (RTX 3090, RTX 5090 32GB, M-series Mac 24GB+)

Run Llama 4 Scout at Q8 for maximum quality, or load Qwen3-30B-A3B (30B MoE, 3B active) for a different flavor. At 24 GB you’ve got real flexibility.

ollama run llama4:scout-q8_0

48 GB+ (dual GPUs, M4 Max 64/128GB, Mac Studio)

Llama 3.3 70B at Q4 (~43 GB) if you want the biggest dense Llama model. Or Llama 4 Maverick if you have 80 GB+ available.

ollama run llama3.3:70b

Step 4: Tweak Quantization

Ollama defaults to Q4_K_M, which is the right choice for most people. If you want higher quality and have VRAM to spare:

ollama run llama3.1:8b-q8_0     # Higher quality, ~2x the VRAM

You can also enable KV cache quantization to save memory on long conversations. Set this environment variable before starting Ollama:

OLLAMA_KV_CACHE_TYPE=q8_0       # Saves ~50% KV cache memory

More on quantization in our quantization guide and explainer post.

Step 5: Useful Commands

ollama list              # See downloaded models
ollama pull llama4:scout # Download without starting a chat
ollama rm llama3.1:8b    # Delete a model to free disk space
ollama ps                # See running models and VRAM usage
ollama serve             # Start the API server (port 11434)

Using Llama with Other Apps

Ollama exposes an OpenAI-compatible API at http://localhost:11434. This works with a growing list of tools:

  • Open WebUI — ChatGPT-style web interface for your local models. Supports conversations, file uploads, and multi-model switching.
  • Continue.dev — AI coding assistant in VS Code, powered by your local model instead of a cloud API.
  • LangChain / LlamaIndex — Build AI apps and agents that call your local Ollama instance.

If you want privacy guarantees, set OLLAMA_NO_CLOUD=1 to ensure nothing leaves your machine.

Troubleshooting

”Error: model requires more memory than available”

Your GPU can’t fit the model. Try a smaller quantization or a smaller model:

ollama run llama3.1:8b-q4_0    # Smaller quant
ollama run llama3.2:3b          # Smaller model

Slow responses

Usually means partial CPU offloading. Run ollama ps to check VRAM usage. If the model is split across GPU and CPU, either use a smaller model or a lower quantization to fit entirely in VRAM.

”Error: could not find GPU”

Update your GPU drivers:

  • NVIDIA: Latest Game Ready or Studio drivers from nvidia.com
  • AMD: ROCm on Linux (note: RX 9070 series not reliably supported yet as of Feb 2026)
  • Intel: Requires IPEX-LLM, not natively supported by Ollama

Beyond Llama

Llama is a solid choice but it’s not the only game in town anymore. Worth checking out:

  • Qwen 3 — Alibaba’s family. Overtook Llama in total downloads. Strong at coding and multilingual tasks.
  • Gemma 3 — Google’s open model. The 12B and 27B variants punch above their weight.
  • DeepSeek R1 — Excellent at reasoning and math tasks.

All available through Ollama. Browse our full model list to compare.