Skip to content

Sampling parameters

Sampling parameters control how the engine selects the next token from the model's probability distribution. They are set per request on every surface:

Terminal window
curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "default",
"messages": [{"role": "user", "content": "Write a haiku."}],
"temperature": 0.7,
"top_p": 0.9
}'

If a request sets no temperature, decoding is greedy: the most likely token is always picked and the probability filters below never run.

When a request is sampled, the engine applies filters in this order:

  1. Penalties, on the raw logits: DRY first, then frequency/presence/repetition in one pass, then logit bias.
  2. Custom logits processors (Rust SDK only).
  3. Temperature scaling and softmax. Temperature absent or 0.0 means greedy argmax, skipping step 4.
  4. On the resulting probabilities: top-k, then top-p, then min-p.

Penalties therefore act before temperature, and the top-k/top-p/min-p trio act after it, on probabilities rather than logits.

Temperature scales the logit distribution before sampling. Higher temperature flattens it; lower temperature sharpens it. 0.0 (or unset) is greedy; 1.0 matches the model's training distribution. Values below 1e-7 are treated as greedy.

Top-k keeps only the k most likely tokens. top_k <= 0 disables it.

Top-p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p. Values outside (0.0, 1.0) disable it.

Min-p scales with the model's confidence: tokens below min_p times the top token's probability are dropped. When the model is confident, min-p filters more tokens; when uncertain, fewer. Values outside (0.0, 1.0) disable it.

Three distinct parameters discourage repetition; all see the full token context so far:

  • presence_penalty: flat logit subtraction for any token that has appeared at all (OpenAI-compatible).
  • frequency_penalty: logit subtraction proportional to a token's occurrence count (OpenAI-compatible).
  • repetition_penalty: llama.cpp-style multiplicative penalty on seen tokens; positive logits are divided by it, negative logits multiplied. 1.0 disables it. This is a separate parameter, not another name for presence_penalty.

DRY penalizes continuing token sequences that would reproduce spans from the preceding text. Off by default (dry_multiplier: 0).

  • dry_multiplier: penalty strength; nonzero enables DRY.
  • dry_base: exponent base for penalty growth with match length. Default 1.75.
  • dry_allowed_length: match length tolerated before the penalty applies. Default 2.
  • dry_sequence_breakers: strings that reset matching. Default ["\n", ":", "\"", "*"].

Interactive mode (mistralrs run) exposes slash commands that persist for the rest of the session:

/temperature 0.7 set sampling temperature, range [0.0, 2.0]; 0 means greedy
/topk 40 set top-k, a positive integer
/topp 0.9 set top-p, in (0.0, 1.0]

Until overridden, interactive mode seeds its sampling from the model's generation_config.json when present, else temperature 0.1, top-k 32, top-p 0.1, min-p 0.05.

The random seed is set at engine startup, not per request: --seed on the CLI, seed= on the Python Runner. With the same seed, prompt, and parameters, output is reproducible.