Llama 3.2 Vision 11B
by Meta · llama-3 family
11B
parameters
text-generation vision reasoning multilingual summarization
Llama 3.2 Vision 11B is Meta's multimodal model capable of understanding both text and images. It can perform visual reasoning, image captioning, document understanding, and visual question answering alongside standard text generation tasks. Built on the Llama 3.2 architecture with a 128K context window, this model brings vision capabilities to a relatively compact size, making it accessible for local deployment on consumer hardware with sufficient VRAM.
Quick Start with Ollama
ollama run 11b-q4_K_M | Creator | Meta |
| Parameters | 11B |
| Architecture | transformer-decoder |
| Context | 128K tokens |
| Released | Sep 25, 2024 |
| License | Llama 3.2 Community License |
| Ollama | llama3.2-vision:11b |
Quantization Options
| Format | File Size | VRAM Required | Quality | Ollama Tag |
|---|---|---|---|---|
| Q4_K_M rec | 6 GB | 8.5 GB | | 11b-q4_K_M |
| Q8_0 | 11.5 GB | 14 GB | | 11b-q8_0 |
| F16 | 22 GB | 26 GB | | 11b-fp16 |
Compatible Hardware
Q4_K_M requires 8.5 GB VRAM
Benchmark Scores
73.0
mmlu