Throughput tuning

A handful of flags control how many requests run at once and how much repeated work the engine can skip. The defaults are reasonable for a single-user server; tune them when traffic is concurrent or shares long prompts.

mistralrs serve -m Qwen/Qwen3-4B \
  --max-seqs 64 \
  --pa-memory-fraction 0.9

--pa-memory-fraction sizes the paged attention block pool; see When paged attention matters below.

Measure with mistralrs bench -m <model> before and after changing anything; flags that help one workload can hurt another.

Concurrency: max running sequences

Cap the number of sequences running at once. Waiting requests queue until a slot frees. Raise it for high-concurrency serving; lower it to bound per-request latency or memory.

mistralrs serve -m Qwen/Qwen3-4B --max-seqs 64

from mistralrs import Runner, Which

runner = Runner(which=Which.Plain(model_id="Qwen/Qwen3-4B"), max_seqs=64)

let model = mistralrs::ModelBuilder::new("Qwen/Qwen3-4B")
    .with_max_num_seqs(64)
    .build()
    .await?;

The default differs by surface:

| Surface | Flag / argument | Default | |---|---|---| | CLI | --max-seqs | 32 | | Python | Runner(max_seqs=...) | 16 | | Rust | .with_max_num_seqs(...) | 32 |

More running sequences need more KV cache. With paged attention, size the block pool to match: one of --pa-context-len, --pa-memory-mb, or --pa-memory-fraction (default: 90% of available VRAM). If the pool runs out of blocks mid-generation, the scheduler preempts a running sequence, frees its blocks, and requeues it, which costs recomputation.

When paged attention matters

Concurrency beyond a few sequences is the main reason to care about paged attention. It is the difference between per-sequence contiguous caches sized for the maximum context and a shared block pool with a fixed budget:

With paged attention (default on CUDA): continuous batching against the block pool, block-level prefix sharing, and the CUDA decode-graph and FlashInfer (NVIDIA's attention-kernel library) fast paths.
Without it (Metal, CPU, or --paged-attn off): the default scheduler is first-come-first-served and batches running sequences by equal length; sequences at other lengths wait their turn and accumulate priority.

Prefix caching

Prefix caching skips prefill for tokens the engine has already processed, which is most valuable for multi-turn chat and shared system prompts.

--prefix-cache-n sets how many cached prefixes stay on device (default 16, 0 disables). --no-kv-cache disables the KV cache entirely and prefix caching with it.

mistralrs serve -m Qwen/Qwen3-4B --prefix-cache-n 32

from mistralrs import Runner, Which

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    prefix_cache_n=32,
    no_kv_cache=False,
)

let model = mistralrs::ModelBuilder::new("Qwen/Qwen3-4B")
    .with_prefix_cache_n(Some(32))
    .build()
    .await?;

.with_no_kv_cache() disables the KV cache (and prefix caching with it).

The mechanism depends on the attention mode:

Paged attention: sequences sharing a token prefix reuse the same reference-counted cache blocks; matching blocks of a new request are not recomputed.
Non-paged: completed sequences' caches are kept whole; a new request whose tokens extend a cached sequence resumes from it. Beyond --prefix-cache-n device entries, the oldest caches are evicted (dropped, not offloaded); a later matching request re-prefills.

Memory planning for automatic device mapping

Automatic device mapping reserves activation and KV memory for a worst case you declare: --max-seq-len (default 4096) and --max-batch-size (default 1). If you expect long prompts or large batches, raise these so layers are not placed too greedily; if memory is tight, the defaults already keep reservations small. See distributed inference for manual placement.

CPU threads and affinity

CPU inference uses Candle's thread pools. On Linux systems that expose heterogeneous CPU capacity through sysfs, mistral.rs defaults the CPU worker count to the high-capacity cores. On homogeneous systems it falls back to the physical core count.

Set RAYON_NUM_THREADS or CANDLE_NUM_THREADS to override the worker count:

CANDLE_NUM_THREADS=10 mistralrs bench --cpu -m meta-llama/Llama-3.2-3B-Instruct

For CPU affinity on Linux, use CANDLE_CPU_MASK with cpulist syntax. When no thread-count env var is set, the mask size also becomes the default worker count:

CANDLE_CPU_MASK=15-19 mistralrs bench --cpu -m meta-llama/Llama-3.2-3B-Instruct
CANDLE_CPU_MASK=5-9,15-19 mistralrs serve --cpu -m meta-llama/Llama-3.2-3B-Instruct

Hard affinity is not enabled by default because it can help one CPU/model pair and hurt another. To try the automatic high-capacity CPU mask without spelling out a list, set CANDLE_CPU_AFFINITY=1. The full list is in the environment variables reference.

CUDA defaults worth knowing

On CUDA, two decode fast paths are on by default and need no tuning: FlashInfer paged kernels and CUDA decode graphs. They exist as env-var switches (MISTRALRS_FLASHINFER_DECODE, MISTRALRS_CUDA_GRAPHS) only for debugging and benchmarking comparisons; details on the paged attention page.

Quantization is the other big throughput lever, traded against quality: see quantize a model.