Skip to content

Model notes

Strict tool grammar. Gemma 4 enforces tool call format through constrained decoding (llguidance). Enabled by default.

Multimodal inputs. Gemma 4 E4B and E2B accept audio, image, and video parts in the same message. No separate variant or flag required.

HF license acceptance. Gemma requires accepting a license on the Hugging Face model page before download. mistralrs login with an HF token suffices after acceptance.

Video frame sampling controls. Per-request sampling overrides are not currently exposed.

Prefix cache with mixed-modal inputs. In multi-turn conversations reusing prefix cache entries, pixel inputs are narrowed per turn by grid count, not image count.

Hybrid attention. Qwen3 Next mixes linear attention layers with standard softmax attention. The cost profile at long contexts differs from a pure softmax model.

Multi-head Latent Attention (MLA). DeepSeek’s attention variant produces smaller KV caches than standard attention. For unexpected behavior under paged attention, try --paged-attn off as a sanity check.

MoE routing. Phi 3.5 MoE routes per-token to experts. Outputs for the same seed vary across quantization levels due to router sensitivity to numerical noise. See the explanation on quantization tradeoffs.

MatFormer slice. Gemma 3n is MatFormer-trained. Select a size variant with --matformer-config-path and --matformer-slice-name (or the matching SDK/TOML fields). Without configuration, the default slice loads. See the MatFormer guide.

Context length. Llama 4 Scout supports up to 10M tokens. Using the full context requires paged attention with a large memory budget, generally with multi-GPU tensor parallelism.

License variants. FLUX.1-schnell is permissive. FLUX.1-dev requires accepting a license on Hugging Face.

Quantization sensitivity. Diffusion models are more sensitive to quantization than language models.

File an issue on GitHub, with a reproducer when possible.