Runner
Runner
Section titled “Runner”Runner.__init__
Section titled “Runner.__init__”__init__( which: Which, max_seqs: int = 16, no_kv_cache: bool = False, prefix_cache_n: int = 16, token_source: str = 'cache', speculative_gamma: int = 32, which_draft: Which | None = None, chat_template: str | None = None, jinja_explicit: str | None = None, num_device_layers: list[str] | None = None, in_situ_quant: str | None = None, anymoe_config: AnyMoeConfig | None = None, pa_gpu_mem: int | None = None, pa_gpu_mem_usage: float | None = None, pa_ctxt_len: int | None = None, pa_blk_size: int | None = None, pa_cache_type: PagedCacheType | None = None, no_paged_attn: bool = False, paged_attn: bool = False, seed: int | None = None, enable_search: bool = False, search_embedding_model: str | None = None, search_callback: Callable[[str], list[dict[str, str]]] | None = None, tool_callbacks: Mapping[str, Callable[[str, dict], str]] | None = None, mcp_client_config: McpClientConfigPy | None = None, code_execution_config: CodeExecutionConfig | None = None,) -> NoneLoad a model.
whichspecifies which model to load or the target model to load in the case of speculative decoding.max_seqsspecifies how many sequences may be running at any time.no_kv_cachedisables the KV cache.prefix_cache_nsets the number of sequences to hold in the device prefix cache, others will be evicted to CPU.token_sourcespecifies where to load the HF token from. The token source follows the following format: “literal:”, “env: ”, “path: ”, “cache” to use a cached token or “none” to use no token. speculative_gammaspecifies thegammaparameter for speculative decoding, the ratio of draft tokens to generate before calling the target model. Ifwhich_draftis not specified, this is ignored.which_draftspecifies which draft model to load. Setting this parameter will cause a speculative decoding model to be loaded, withwhichas the target (higher quality) model andwhich_draftas the draft (lower quality) model.chat_templatespecifies an optional JINJA chat template as a JSON file. This chat template should havemessages,add_generation_prompt,bos_token,eos_token, andunk_tokenas inputs. It is used if the automatic deserialization fails. If this ends with.json(i.e., it is a file) then that template is loaded.jinja_explicitallows an explicit JINJA chat template file to be used. If specified, this overrides all other chat templates.num_device_layerssets the number of layers to load and run on each device. Each element follows the format ORD:NUM where ORD is the device ordinal and NUM is the corresponding number of layers. Note: this is deprecated in favor of automatic device mapping.in_situ_quantsets the optional in-situ quantization for a model.anymoe_configspecifies the AnyMoE config. If this is set, then the model will be loaded as an AnyMoE model.pa_gpu_mem: GPU memory to allocate for KV cache with PagedAttention in MBs. PagedAttention is supported on CUDA and Metal. It is automatically activated on CUDA but not on Metal. The priority is as follows:pa-ctxt-len>pa-gpu-mem-usage>pa-gpu-mem.pa_gpu_mem_usage: Percentage of GPU memory to utilize after allocation of KV cache with PagedAttention, from 0 to 1. If this is not set and the device is CUDA, it will default to0.9. PagedAttention is supported on CUDA and Metal. It is automatically activated on CUDA but not on Metal. The priority is as follows:pa-ctxt-len>pa-gpu-mem-usage>pa-gpu-mem.pa_ctxt_len: Total context length to allocate the KV cache for (total number of tokens which the KV cache can hold). PagedAttention is supported on CUDA and Metal. It is automatically activated on CUDA but not on Metal. The priority is as follows:pa-ctxt-len>pa-gpu-mem-usage>pa-gpu-mem. This is the default setting, and it defaults to themax-seq-lenspecified in after the model type.pa_blk_sizesets the block size (number of tokens per block) for PagedAttention. If this is not set and the device is CUDA, it will default to 32. PagedAttention is supported on CUDA and Metal. It is automatically activated on CUDA but not on Metal.pa_cache_typesets the PagedAttention KV cache type (auto or f8e4m3). Defaults toauto.no_paged_attndisables PagedAttention on CUDA. Because PagedAttention is already disabled on Metal, this is only applicable on CUDA.paged_attnenables PagedAttention on Metal. Because PagedAttention is already enabled on CUDA, this is only applicable on Metal.seed, used to ensure reproducible random number generation.enable_search: Enable searching compatible with the OpenAIweb_search_optionssetting. This loads the selected search embedding reranker (EmbeddingGemma by default).search_embedding_model: select which built-in search embedding model to load (currently"embedding_gemma").search_callback: Custom Python callable to perform web searches. Should accept a query string and return a list of dicts with keys “title”, “description”, “url”, and “content”.tool_callbacks: Mapping from tool name to Python callable invoked for generic tool calls. Each callable receives the tool name and a dict of arguments and should return the tool output as a string.code_execution_config: enables the built-in Python code execution tool. Pass aCodeExecutionConfigto configure the interpreter, per-call timeout, and working directory. Per-request, setChatCompletionRequest.enable_code_execution=True.
Runner.send_chat_completion_request
Section titled “Runner.send_chat_completion_request”send_chat_completion_request( request: ChatCompletionRequest, model_id: str | None = None,) -> ChatCompletionResponse | Iterator[ChatCompletionChunkResponse]Send a chat completion request to the mistral.rs engine, returning the response object or a generator over chunk objects.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
request | ChatCompletionRequest | required | The chat completion request. |
model_id | str | None | None | Optional model ID to send the request to. If None, uses the default model. |
Runner.send_completion_request
Section titled “Runner.send_completion_request”send_completion_request( request: CompletionRequest, model_id: str | None = None,) -> CompletionResponseSend a completion request to the mistral.rs engine, returning the response object.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
request | CompletionRequest | required | The completion request. |
model_id | str | None | None | Optional model ID to send the request to. If None, uses the default model. |
Runner.send_embedding_request
Section titled “Runner.send_embedding_request”send_embedding_request( request: EmbeddingRequest, model_id: str | None = None,) -> list[list[float]]Generate embeddings for the supplied inputs and return one embedding vector per input.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
request | EmbeddingRequest | required | The embedding request. |
model_id | str | None | None | Optional model ID to send the request to. If None, uses the default model. |
Runner.generate_image
Section titled “Runner.generate_image”generate_image( prompt: str, response_format: ImageGenerationResponseFormat, height: int = 720, width: int = 1280, model_id: str | None = None, save_file: str | None = None,) -> ImageGenerationResponseGenerate an image.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
prompt | str | required | The image generation prompt. |
response_format | ImageGenerationResponseFormat | required | The response format (Url or B64Json). |
height | int | 720 | Image height in pixels. |
width | int | 1280 | Image width in pixels. |
model_id | str | None | None | Optional model ID to send the request to. If None, uses the default model. |
save_file | str | None | None | Optional path where the PNG is written when response_format is Url. Defaults to an auto-generated filename. |
Runner.generate_audio
Section titled “Runner.generate_audio”generate_audio( prompt: str, model_id: str | None = None,) -> SpeechGenerationResponseGenerate audio given a (model specific) prompt. PCM and sampling rate as well as the number of channels is returned.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
prompt | str | required | The audio generation prompt. |
model_id | str | None | None | Optional model ID to send the request to. If None, uses the default model. |
Runner.send_re_isq
Section titled “Runner.send_re_isq”send_re_isq(dtype: str, model_id: str | None = None) -> NoneSend a request to re-ISQ the model. If the model was loaded as GGUF or GGML then nothing will happen.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
dtype | str | required | The ISQ dtype (e.g., “Q4K”, “Q8_0”). |
model_id | str | None | None | Optional model ID to re-ISQ. If None, uses the default model. |
Runner.tokenize_text
Section titled “Runner.tokenize_text”tokenize_text( text: str, add_special_tokens: bool, enable_thinking: bool | None, model_id: str | None = None,) -> list[int]Tokenize some text, returning raw tokens.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
text | str | required | The text to tokenize. |
add_special_tokens | bool | required | Whether to add special tokens. |
enable_thinking | bool | None | required | Enables thinking for models that support this configuration. |
model_id | str | None | None | Optional model ID to use for tokenization. If None, uses the default model. |
Runner.detokenize_text
Section titled “Runner.detokenize_text”detokenize_text( tokens: list[int], skip_special_tokens: bool, model_id: str | None = None,) -> strDetokenize some tokens, returning text.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
tokens | list[int] | required | The tokens to detokenize. |
skip_special_tokens | bool | required | Whether to skip special tokens. |
model_id | str | None | None | Optional model ID to use for detokenization. If None, uses the default model. |
Runner.max_sequence_length
Section titled “Runner.max_sequence_length”max_sequence_length(model_id: str | None = None) -> int | NoneReturn the maximum supported sequence length for the current or specified model, or None when the concept does not apply (such as diffusion or speech models).
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | None | None | Optional model ID to query. If None, uses the default model. |
Runner.list_models
Section titled “Runner.list_models”list_models() -> list[str]List all available model IDs (aliases if configured).
Returns: A list of model ID strings.
Runner.get_default_model_id
Section titled “Runner.get_default_model_id”get_default_model_id() -> str | NoneGet the current default model ID.
Returns: The default model ID, or None if no default is set.
Runner.set_default_model_id
Section titled “Runner.set_default_model_id”set_default_model_id(model_id: str) -> NoneSet the default model ID. The model must already be loaded.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | required | The model ID to set as default. |
Raises: ValueError: If the model ID is not found.
Runner.is_model_loaded
Section titled “Runner.is_model_loaded”is_model_loaded(model_id: str) -> boolCheck if a model is currently loaded in memory.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | required | The model ID to check. |
Returns: True if the model is loaded, False otherwise.
Runner.unload_model
Section titled “Runner.unload_model”unload_model(model_id: str) -> NoneUnload a model from memory while preserving its configuration for later reload. The model can be reloaded manually with reload_model() or automatically when a request is sent to it.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | required | The model ID to unload. |
Runner.reload_model
Section titled “Runner.reload_model”reload_model(model_id: str) -> NoneManually reload a previously unloaded model.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | required | The model ID to reload. |
Runner.list_models_with_status
Section titled “Runner.list_models_with_status”list_models_with_status() -> list[tuple[str, str]]List all models with their current status.
Returns: A list of (model_id, status) tuples where status is one of:
- “loaded”: Model is loaded and ready
- “unloaded”: Model is unloaded but can be reloaded
- “reloading”: Model is currently being reloaded
Runner.list_unloaded_models
Section titled “Runner.list_unloaded_models”list_unloaded_models() -> list[str]List model IDs that are currently unloaded (but can be reloaded).
Runner.get_model_status
Section titled “Runner.get_model_status”get_model_status(model_id: str) -> str | NoneGet the status of a model: “loaded”, “unloaded”, “reloading”, or None if not found.
Runner.remove_model
Section titled “Runner.remove_model”remove_model(model_id: str) -> NoneRemove a model by ID in multi-model mode.
Runner.send_chat_completion_request_to_model
Section titled “Runner.send_chat_completion_request_to_model”send_chat_completion_request_to_model( request: ChatCompletionRequest, model_id: str,) -> ChatCompletionResponse | Iterator[ChatCompletionChunkResponse]Send a chat completion request to a specific model, returning the response object or a generator over chunk objects.
Runner.send_completion_request_to_model
Section titled “Runner.send_completion_request_to_model”send_completion_request_to_model( request: CompletionRequest, model_id: str,) -> CompletionResponseSend a completion request to a specific model.
Runner.export_session
Section titled “Runner.export_session”export_session( session_id: str, model_id: str | None = None,) -> str | NoneExport an agentic session by ID as a JSON string.
Returns None if the session does not exist.
Runner.import_session
Section titled “Runner.import_session”import_session( session_id: str, session_json: str, model_id: str | None = None,) -> NoneImport an agentic session from a JSON string.
Replaces any existing session with the same ID.
Runner.delete_session
Section titled “Runner.delete_session”delete_session(session_id: str, model_id: str | None = None) -> boolDelete an agentic session. Returns whether the session existed.
Runner.list_session_ids
Section titled “Runner.list_session_ids”list_session_ids(model_id: str | None = None) -> list[str]List all stored agentic session IDs.
Runner.find_file
Section titled “Runner.find_file”find_file(file_id: str) -> File | NoneLook up a produced file by id. Returns the full body even if the file was wire-truncated in the response payload.
Generated from mistralrs-pyo3/mistralrs.pyi.