Skip to content

Use speculative decoding

Speculative decoding pairs a small fast model that drafts tokens with a large target model that verifies them in a single forward pass.

Speculative decoding is exposed through the SDKs only. It is not configurable via the CLI or the TOML config.

Runner accepts which_draft (a Which) and speculative_gamma (default 32):

from mistralrs import Runner, Which
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-32B"),
which_draft=Which.Plain(model_id="Qwen/Qwen3-0.6B"),
speculative_gamma=4,
in_situ_quant="4",
)

TextSpeculativeBuilder is the entry point.

Throughput gain depends on acceptance rate, the fraction of draft tokens the target accepts. Same-family draft/target pairings tend to have higher acceptance rates than cross-family pairings.

Both models load together and share GPU memory.