Skip to content

Nemotron Ultra 253B

by NVIDIA · nemotron family

253B

parameters

text-generation code-generation reasoning multilingual tool-use math creative-writing summarization

Nemotron Ultra 253B is NVIDIA's most capable open-weight reasoning model, derived from Llama 3.1 405B and compressed to 253B parameters using Neural Architecture Search (NAS). It delivers state-of-the-art performance on math, coding, and complex reasoning benchmarks while fitting on a single 8xH100 node at FP8 precision. The model features a dual-mode system supporting both standard chat and explicit chain-of-thought reasoning, toggled via system prompt. It supports a 128K context window and excels at tool calling, RAG, and agentic workflows. With multilingual support for English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, it is one of the most versatile open-weight models available.

Quick Start with Ollama

ollama run 253b-q4_K_M
Resources Ollama Hugging Face Official Page Research Paper
Creator NVIDIA
Parameters 253B
Architecture transformer-decoder
Context 128K tokens
Released Apr 7, 2025
License NVIDIA Open Model License
Ollama nemotron-ultra:253b

Quantization Options

Format File Size VRAM Required Quality Ollama Tag
Q4_K_M rec 151 GB 155 GB 253b-q4_K_M
Q8_0 269 GB 275 GB 253b-q8_0
F16 506 GB 508 GB 253b-fp16

Compatible Hardware

Q4_K_M requires 155 GB VRAM

Compatible Hardware

HardwareVRAMTypeFitEst. Speed
Mac Studio M4 Ultra 512GB512 GBmacRuns~5 tok/s
Mac Pro M2 Ultra 192GB192 GBmacRuns~5 tok/s
Mac Studio M4 Ultra 192GB192 GBmacRuns~5 tok/s
Mac Studio M4 Max 128GB128 GBmacCPU Offload~1 tok/s
MacBook Pro M4 Max 128GB128 GBmacCPU Offload~1 tok/s
MacBook Pro M5 Max 128GB128 GBmacCPU Offload~1 tok/s
101 hardware device(s) cannot run this model at Q4_K_M.

Benchmark Scores

88.0
mmlu
97.0
math500
76.0
gpqa