Throughput tuning
A handful of flags control how many requests run at once and how much repeated work the engine can skip. The defaults are reasonable for a single-user server; tune them when traffic is concurrent or shares long prompts.
mistralrs serve -m Qwen/Qwen3-4B \ --max-seqs 64 \ --pa-memory-fraction 0.9--pa-memory-fraction sizes the paged attention block pool; see When paged attention matters below.
Measure with mistralrs bench -m <model> before and after changing anything; flags that help one workload can hurt another.
Concurrency: max running sequences
Section titled “Concurrency: max running sequences”Cap the number of sequences running at once. Waiting requests queue until a slot frees. Raise it for high-concurrency serving; lower it to bound per-request latency or memory.
mistralrs serve -m Qwen/Qwen3-4B --max-seqs 64from mistralrs import Runner, Which
runner = Runner(which=Which.Plain(model_id="Qwen/Qwen3-4B"), max_seqs=64)let model = mistralrs::ModelBuilder::new("Qwen/Qwen3-4B") .with_max_num_seqs(64) .build() .await?;The default differs by surface:
| Surface | Flag / argument | Default |
|---|---|---|
| CLI | --max-seqs | 32 |
| Python | Runner(max_seqs=...) | 16 |
| Rust | .with_max_num_seqs(...) | 32 |
More running sequences need more KV cache. With paged attention, size the block pool to match: one of --pa-context-len, --pa-memory-mb, or --pa-memory-fraction (default: 90% of available VRAM). If the pool runs out of blocks mid-generation, the scheduler preempts a running sequence, frees its blocks, and requeues it, which costs recomputation.
When paged attention matters
Section titled “When paged attention matters”Concurrency beyond a few sequences is the main reason to care about paged attention. It is the difference between per-sequence contiguous caches sized for the maximum context and a shared block pool with a fixed budget:
- With paged attention (default on CUDA): continuous batching against the block pool, block-level prefix sharing, and the CUDA decode-graph and FlashInfer (NVIDIA’s attention-kernel library) fast paths.
- Without it (Metal, CPU, or
--paged-attn off): the default scheduler is first-come-first-served and batches running sequences by equal length; sequences at other lengths wait their turn and accumulate priority.
Prefix caching
Section titled “Prefix caching”Prefix caching skips prefill for tokens the engine has already processed, which is most valuable for multi-turn chat and shared system prompts.
--prefix-cache-n sets how many cached prefixes stay on device (default 16, 0 disables). --no-kv-cache disables the KV cache entirely and prefix caching with it.
mistralrs serve -m Qwen/Qwen3-4B --prefix-cache-n 32from mistralrs import Runner, Which
runner = Runner( which=Which.Plain(model_id="Qwen/Qwen3-4B"), prefix_cache_n=32, no_kv_cache=False,)let model = mistralrs::ModelBuilder::new("Qwen/Qwen3-4B") .with_prefix_cache_n(Some(32)) .build() .await?;.with_no_kv_cache() disables the KV cache (and prefix caching with it).
The mechanism depends on the attention mode:
- Paged attention: sequences sharing a token prefix reuse the same reference-counted cache blocks; matching blocks of a new request are not recomputed.
- Non-paged: completed sequences’ caches are kept whole; a new request whose tokens extend a cached sequence resumes from it. Beyond
--prefix-cache-ndevice entries, the oldest caches are evicted (dropped, not offloaded); a later matching request re-prefills.
Memory planning for automatic device mapping
Section titled “Memory planning for automatic device mapping”Automatic device mapping reserves activation and KV memory for a worst case you declare: --max-seq-len (default 4096) and --max-batch-size (default 1). If you expect long prompts or large batches, raise these so layers are not placed too greedily; if memory is tight, the defaults already keep reservations small. See distributed inference for manual placement.
CUDA defaults worth knowing
Section titled “CUDA defaults worth knowing”On CUDA, two decode fast paths are on by default and need no tuning: FlashInfer paged kernels and CUDA decode graphs. They exist as env-var switches (MISTRALRS_FLASHINFER_DECODE, MISTRALRS_CUDA_GRAPHS) only for debugging and benchmarking comparisons; details on the paged attention page.
Quantization is the other big throughput lever, traded against quality: see quantize a model.