Skip to content

Llama 3.2 Vision 11B

Llama 3.2 Community License

Meta · 11B · transformer-decoder

2024-09-25 131K context 11B params

Use Cases

chat vision reasoning multilingual summary

Quantization Options

QuantBitsVRAMQualityStatus
Q4_K_Mrec48.5 GBGood
Q8_0814.0 GBGood
F161626.0 GBExcellent

About this model

Llama 3.2 Vision 11B is Meta's multimodal model capable of understanding both text and images. It can perform visual reasoning, image captioning, document understanding, and visual question answering alongside standard text generation tasks. Built on the Llama 3.2 architecture with a 128K context window, this model brings vision capabilities to a relatively compact size, making it accessible for local deployment on consumer hardware with sufficient VRAM.

Benchmarks

73.0
mmlu