Sampling parameters
Sampling parameters control how the engine selects the next token from the model's probability distribution. They are set per request on every surface:
curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "default", "messages": [{"role": "user", "content": "Write a haiku."}], "temperature": 0.7, "top_p": 0.9}'If a request sets no temperature, decoding is greedy: the most likely token is always picked and the probability filters below never run.
Application order
Section titled “Application order”When a request is sampled, the engine applies filters in this order:
- Penalties, on the raw logits: DRY first, then frequency/presence/repetition in one pass, then logit bias.
- Custom logits processors (Rust SDK only).
- Temperature scaling and softmax. Temperature absent or
0.0means greedy argmax, skipping step 4. - On the resulting probabilities: top-k, then top-p, then min-p.
Penalties therefore act before temperature, and the top-k/top-p/min-p trio act after it, on probabilities rather than logits.
Temperature, top-p, top-k, min-p
Section titled “Temperature, top-p, top-k, min-p”Temperature scales the logit distribution before sampling. Higher temperature flattens it; lower temperature sharpens it. 0.0 (or unset) is greedy; 1.0 matches the model's training distribution. Values below 1e-7 are treated as greedy.
Top-k keeps only the k most likely tokens. top_k <= 0 disables it.
Top-p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p. Values outside (0.0, 1.0) disable it.
Min-p scales with the model's confidence: tokens below min_p times the top token's probability are dropped. When the model is confident, min-p filters more tokens; when uncertain, fewer. Values outside (0.0, 1.0) disable it.
Penalties
Section titled “Penalties”Three distinct parameters discourage repetition; all see the full token context so far:
presence_penalty: flat logit subtraction for any token that has appeared at all (OpenAI-compatible).frequency_penalty: logit subtraction proportional to a token's occurrence count (OpenAI-compatible).repetition_penalty: llama.cpp-style multiplicative penalty on seen tokens; positive logits are divided by it, negative logits multiplied.1.0disables it. This is a separate parameter, not another name forpresence_penalty.
DRY (Don't Repeat Yourself)
Section titled “DRY (Don't Repeat Yourself)”DRY penalizes continuing token sequences that would reproduce spans from the preceding text. Off by default (dry_multiplier: 0).
dry_multiplier: penalty strength; nonzero enables DRY.dry_base: exponent base for penalty growth with match length. Default1.75.dry_allowed_length: match length tolerated before the penalty applies. Default2.dry_sequence_breakers: strings that reset matching. Default["\n", ":", "\"", "*"].
Setting parameters
Section titled “Setting parameters”Interactive mode (mistralrs run) exposes slash commands that persist for the rest of the session:
/temperature 0.7 set sampling temperature, range [0.0, 2.0]; 0 means greedy/topk 40 set top-k, a positive integer/topp 0.9 set top-p, in (0.0, 1.0]Until overridden, interactive mode seeds its sampling from the model's generation_config.json when present, else temperature 0.1, top-k 32, top-p 0.1, min-p 0.05.
All parameters are top-level JSON fields on /v1/chat/completions and /v1/completions:
{ "model": "default", "messages": [{"role": "user", "content": "Write a haiku."}], "temperature": 0.7, "top_k": 40, "top_p": 0.9, "min_p": 0.05, "presence_penalty": 0.5, "repetition_penalty": 1.1, "dry_multiplier": 0.8}The same names are fields on ChatCompletionRequest:
request = ChatCompletionRequest( model="default", messages=[{"role": "user", "content": "Write a haiku."}], temperature=0.7, top_k=40, top_p=0.9, min_p=0.05, presence_penalty=0.5, repetition_penalty=1.1,)RequestBuilder has per-parameter setters, or pass a whole SamplingParams:
let request = RequestBuilder::new() .add_message(TextMessageRole::User, "Write a haiku.") .set_sampler_temperature(0.7) .set_sampler_topk(40) .set_sampler_topp(0.9) .set_sampler_presence_penalty(0.5);The random seed is set at engine startup, not per request: --seed on the CLI, seed= on the Python Runner. With the same seed, prompt, and parameters, output is reproducible.