Skip to content

Use speculative decoding

Speculative decoding lets a smaller assistant propose future tokens while the target model verifies them in parallel. mistral.rs exposes this through the generic MTP API.

ModeTarget modelsAssistant modelStatusGuide
MTPGemma 4Gemma 4 assistant checkpointsSupported with PagedAttentionGemma 4 MTP

Legacy target/draft speculative decoding has been removed. New speculative decoding features should use the MTP proposer/target path.

Use --mtp-model with an assistant model id or path:

Terminal window
mistralrs run -m <target-model> \
--mtp-model <assistant-model-or-path> \
--mtp-n-predict 6

--mtp-n-predict controls how many assistant tokens are proposed per step. If it is omitted, mistral.rs reads num_assistant_tokens from the assistant generation_config.json and falls back to 6.

Runner accepts mtp_model and mtp_n_predict:

from mistralrs import Runner, Which
runner = Runner(
which=Which.Plain(model_id="<target-model>"),
mtp_model="<assistant-model-or-path>",
mtp_n_predict=6,
)

Builders that load text, multimodal, or auto-detected models accept an MTP config:

use mistralrs::{ModelBuilder, MtpConfig};
let model = ModelBuilder::new("<target-model>")
.with_mtp_config(MtpConfig {
model: "<assistant-model-or-path>".to_string(),
n_predict: Some(6),
})
.build()
.await?;

with_mtp_model("<assistant-model-or-path>", Some(6)) is equivalent for common cases.

MTP remains exact because accepted output is verified by the target model before it is emitted. Throughput gain depends on how many proposed tokens the target accepts and on the cost of the target verification pass.

Gemma 4 MTP requires PagedAttention. Non-paged KV-cache MTP is disabled while that path is developed separately.

MTP supports batched generation and constrained decoding.