Nemotron Ultra 253B

Name: Nemotron Ultra 253B
Author: NVIDIA

253B

parameters

text-generation code-generation reasoning multilingual tool-use math creative-writing summarization

Nemotron Ultra 253B is NVIDIA's most capable open-weight reasoning model, derived from Llama 3.1 405B and compressed to 253B parameters using Neural Architecture Search (NAS). It delivers state-of-the-art performance on math, coding, and complex reasoning benchmarks while fitting on a single 8xH100 node at FP8 precision. The model features a dual-mode system supporting both standard chat and explicit chain-of-thought reasoning, toggled via system prompt. It supports a 128K context window and excels at tool calling, RAG, and agentic workflows. With multilingual support for English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, it is one of the most versatile open-weight models available.

Quick Start with Ollama


ollama run 253b-q4_K_M

Resources Ollama Hugging Face Official Page Research Paper

Creator	NVIDIA
Parameters	253B
Architecture	transformer-decoder
Context	128K tokens
Released	Apr 7, 2025
License	NVIDIA Open Model License
Ollama	nemotron-ultra:253b

Quantization Options

Format	File Size	VRAM Required	Ollama Tag
Q4_K_M rec	151 GB	155 GB	`253b-q4_K_M`
Q8_0	269 GB	275 GB	`253b-q8_0`
F16	506 GB	508 GB	`253b-fp16`

Compatible Hardware

Q4_K_M requires 155 GB VRAM

Compatible Hardware

Hardware	VRAM	Type	Fit	Est. Speed
Mac Studio M4 Ultra 512GB	512 GB	mac	Runs	~5 tok/s
Mac Pro M2 Ultra 192GB	192 GB	mac	Runs	~5 tok/s
Mac Studio M4 Ultra 192GB	192 GB	mac	Runs	~5 tok/s
Mac Studio M4 Max 128GB	128 GB	mac	CPU Offload	~1 tok/s
MacBook Pro M4 Max 128GB	128 GB	mac	CPU Offload	~1 tok/s
MacBook Pro M5 Max 128GB	128 GB	mac	CPU Offload	~1 tok/s

101 hardware device(s) cannot run this model at Q4_K_M.

Benchmark Scores

88.0

mmlu

97.0

math500

76.0

gpqa