Introducing DiffusionGemma, an experimental open 26B Mixture of Experts model that moves beyond traditional sequential generation to process and generate entire blocks of text simultaneously.
DiffusionGemma unlocks new value for developers:
- Generates 1,000+ tokens/sec on an NVIDIA H100 and 700+ tokens/sec on an RTX 5090;
- Optimizes non-linear workflows like code infilling, inline editing, and real-time self-correction;
- Comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized;
- Supports native integration for MLX, vLLM, Hugging Face, and Unsloth with advanced NVIDIA NVFP4 kernel optimization;
See the official webpage.

