Llama 3.2 Vision 11B
Llama 3.2 Community LicenseMeta · 11B · transformer-decoder
2024-09-25 131K context
11B params
Use Cases
chat vision reasoning multilingual summary
Quantization Options
About this model
Llama 3.2 Vision 11B is Meta's multimodal model capable of understanding both text and images. It can perform visual reasoning, image captioning, document understanding, and visual question answering alongside standard text generation tasks.
Built on the Llama 3.2 architecture with a 128K context window, this model brings vision capabilities to a relatively compact size, making it accessible for local deployment on consumer hardware with sufficient VRAM.
Benchmarks
73.0
mmlu