Model notes
Gemma 4
Section titled “Gemma 4”Strict tool grammar. Gemma 4 enforces tool call format through constrained decoding (llguidance). Enabled by default.
Multimodal inputs. Gemma 4 E4B and E2B accept audio, image, and video parts in the same message. No separate variant or flag required.
HF license acceptance. Gemma requires accepting a license on the Hugging Face model page before download. mistralrs login with an HF token suffices after acceptance.
Qwen3-VL
Section titled “Qwen3-VL”Video frame sampling controls. Per-request sampling overrides are not currently exposed.
Prefix cache with mixed-modal inputs. In multi-turn conversations reusing prefix cache entries, pixel inputs are narrowed per turn by grid count, not image count.
Qwen3 Next
Section titled “Qwen3 Next”Hybrid attention. Qwen3 Next mixes linear attention layers with standard softmax attention. The cost profile at long contexts differs from a pure softmax model.
DeepSeek V2 and V3
Section titled “DeepSeek V2 and V3”Multi-head Latent Attention (MLA). DeepSeek’s attention variant produces smaller KV caches than standard attention. For unexpected behavior under paged attention, try --paged-attn off as a sanity check.
Phi 3.5 MoE
Section titled “Phi 3.5 MoE”MoE routing. Phi 3.5 MoE routes per-token to experts. Outputs for the same seed vary across quantization levels due to router sensitivity to numerical noise. See the explanation on quantization tradeoffs.
Gemma 3n
Section titled “Gemma 3n”MatFormer slice. Gemma 3n is MatFormer-trained. Select a size variant with --matformer-config-path and --matformer-slice-name (or the matching SDK/TOML fields). Without configuration, the default slice loads. See the MatFormer guide.
Llama 4
Section titled “Llama 4”Context length. Llama 4 Scout supports up to 10M tokens. Using the full context requires paged attention with a large memory budget, generally with multi-GPU tensor parallelism.
License variants. FLUX.1-schnell is permissive. FLUX.1-dev requires accepting a license on Hugging Face.
Quantization sensitivity. Diffusion models are more sensitive to quantization than language models.
Reporting issues
Section titled “Reporting issues”File an issue on GitHub, with a reproducer when possible.