Block-diffusion models
Block-diffusion models generate text by iteratively denoising whole blocks of tokens in parallel instead of sampling one token at a time. The mechanism:
- A causal encoder fills the KV cache with the prompt.
- The model refines a block (a “canvas”) of mask tokens over a handful of bidirectional passes.
- It commits the block and repeats.
Because each pass commits many tokens at once, decode throughput is higher than a comparable autoregressive model.
Currently supported:
- DiffusionGemma (
google/diffusiongemma-26B-A4B-it), a 26B-A4B MoE (Mixture of Experts) model with vision input, built on the Gemma 4 architecture.
Quick start
Section titled “Quick start”No special flags or APIs: block-diffusion models are detected automatically and served through the standard endpoints.
mistralrs run -m google/diffusiongemma-26B-A4B-itmistralrs serve -p 1234 -m google/diffusiongemma-26B-A4B-itcurl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{"role": "user", "content": "Why is the sky blue?"}], "max_tokens": 1024 }'The standard chat-completion API works unchanged. See the Python example (vision input).
from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which
runner = Runner( which=Which.MultimodalPlain( model_id="google/diffusiongemma-26B-A4B-it", arch=MultimodalArchitecture.DiffusionGemma, ))
response = runner.send_chat_completion_request( ChatCompletionRequest( model="default", messages=[{"role": "user", "content": "Why is the sky blue?"}], max_tokens=1024, ))print(response.choices[0].message.content)The standard chat API works unchanged. See the Rust example (streaming, shows block-at-a-time output).
use mistralrs::{MultimodalModelBuilder, TextMessageRole, TextMessages};
let model = MultimodalModelBuilder::new("google/diffusiongemma-26B-A4B-it") .build() .await?;
let messages = TextMessages::new().add_message(TextMessageRole::User, "Why is the sky blue?");let response = model.send_chat_request(messages).await?;println!("{}", response.choices[0].message.content.as_ref().unwrap());What behaves differently
Section titled “What behaves differently”- Streaming is bursty. Output arrives one block (256 tokens by default, set by the checkpoint’s
canvas_length) at a time, after that block’s denoising loop converges, rather than token by token. - Sampling is the diffusion schedule. The temperature ramp, entropy-bound acceptance, and stopping thresholds come from the checkpoint’s
generation_config.json. Request-leveltemperature,top_p, and penalties are ignored.max_tokensstill caps output length. - Stats split differently. Prompt T/s measures the encoder prefill alone; decode T/s is the effective denoising throughput (committed tokens over denoising time).
- Thinking is on by default. DiffusionGemma’s channel-tag reasoning is parsed into the reasoning field, like other thinking models.
- Tool calling works through the model’s native format, including calls spanning block boundaries. Grammar-constrained generation is NOT enforced during denoising, so
tool_choice: required, named tools, and JSON schema outputs are unconstrained: the model relies on its trained formatting instead. (Per-token grammars are incompatible with parallel token refinement.) - Concurrency batches by context length. Concurrent requests with equal context lengths batch together and denoise their blocks in lockstep, amortizing the per-block MoE computation across requests. Requests with different prompt lengths run as separate groups.
See also: model family notes and the supported models reference.