Agentic runtime for apps
mistral.rs can act as a local-first runtime for agent applications. A single runtime request can include:
- Model generation (chat-completion responses and chunks).
- Server-side tool execution.
- Python code execution, sandboxed by default on Linux and macOS.
- Shell execution, sandboxed by default on Linux and macOS.
- OpenAI-compatible Skills.
- OpenAI-compatible file inputs.
- Web search.
- Generated images or video frames from tools.
- Persistent session state.
The most complete app-facing event stream today is /v1/chat/completions with stream: true. It emits normal OpenAI-compatible chunks plus mistral.rs agentic_tool_call_progress Server-Sent Events (SSE).
| Runtime part | What mistral.rs provides |
|---|---|
| Model output | Chat-completion responses and streaming chunks. |
| Tool execution | Built-in search, code execution, shell, OpenAI-compatible Skills, MCP (Model Context Protocol) tools, callbacks, or HTTP tool dispatch. |
| Generated media | Captured images and video frames from tools as base64 fields. |
| Files | User-provided input files plus generated output files in the same /v1/files registry. |
| Session state | Reusable session_id values for multi-turn tool and code state. |
Use this when an app wants inference and tool execution in one process rather than running its own tool loop around a model server. Built-in runtime tools are strict by default; whether an action may run at all is governed by permissions and approvals.
How the loop runs
Section titled “How the loop runs”The server-side loop engages for a chat request when any of these hold:
- The request sets
web_search_options(advertises the web search tools). - The request includes
tools: [{"type":"code_interpreter","container":{"type":"auto"}}]on a server or runner with code execution enabled. - The request includes
tools: [{"type":"shell","environment":{"type":"container_auto"}}]on the Responses API, or the SDK request enables shell. - The request carries
toolsand server-side executors exist for them (SDKtool_callbacksor connected MCP tools). - The request sets
max_tool_rounds, or the server has a--tool-dispatch-url.
Otherwise the request is dispatched normally: the model’s tool_calls field is returned to the client and the client runs the next round (the standard OpenAI-compatible flow).
Each round:
- The engine runs inference. The result either contains tool calls or does not.
- No tool calls: the loop exits and the response is forwarded to the client.
- The loop emits a progress event with phase
callingand the tool arguments. - The tool is executed through one of the paths above (built-in search, code execution, shell, file helpers, a registered callback, or a POST to the dispatch URL). If the model returns more than one tool call, only the first is executed and a warning is logged.
- The loop emits a progress event with phase
completeand the structured result. - The message history is extended with the assistant’s tool-call message and a
tool-role response, so the next inference pass sees the outcome. - If the round counter reaches the cap, the loop exits without another tool opportunity.
The cap and dispatch URL are configured on the tool calling page. At termination, the expanded message list is written back to the session, so the next request with the same session id sees the synthesized tool messages as history.
HTTP run stream
Section titled “HTTP run stream”Start a server with the tools your app is allowed to use:
mistralrs serve --agent -m google/gemma-4-E4B-it(--agent enables search, code execution, and shell; see build an agent.)
Send a streaming chat-completions request:
{ "model": "default", "stream": true, "messages": [ {"role": "user", "content": "Use Python to plot sin(x), then explain the chart."} ], "tools": [{"type": "code_interpreter", "container": {"type": "auto"}}], "web_search_options": {}, "max_tool_rounds": 4, "session_id": "analysis-demo"}Model output arrives as standard chat-completion chunks. Tool progress arrives as named SSE events with round, an opaque tool_name for correlation, phase (calling or complete), and tool-type-specific data:
event: agentic_tool_call_progressdata: {"type":"agentic_tool_call_progress","round":0,"tool_name":"<tool identifier>","phase":"calling","data":{"tool_type":"code_execution","code":"print('hello')"}}Complete events carry tool-type-specific payloads:
- Code execution:
stdout,stderr,images_base64,video_frames_base64,working_directory,execution_time_ms. - Shell:
commands,stdout,stderr,exit_code,timed_out, and status. - Web search:
query,results_count. - Custom tools:
arguments,content.
The full event tables are in the HTTP API reference. Non-streaming responses include the same information as an agentic_tool_calls array.
A File is a typed output produced by a tool, typically code execution or shell. Each file has a stable id, a name, a format, a mime type, a size in bytes, and either an inline body or a reference for fetching it. Files are first-class on the wire: they ride alongside the model transcript, not buried inside tool output strings.
Declare required outputs on the request to give the model a contract:
{ "model": "default", "messages": [ {"role": "user", "content": "Generate a sin(x) plot and a CSV of the samples."} ], "tools": [{"type": "code_interpreter", "container": {"type": "auto"}}], "files": [ {"name": "plot.png", "format": "png"}, {"name": "samples.csv", "format": "csv", "description": "x, sin(x) columns"} ]}Chat Completions and Anthropic Messages carry produced files in a top-level files array; when streaming, each file is emitted as soon as it is produced via a file_produced SSE event. Each agentic_tool_calls[*] record gains a file_ids field listing the files attributable to that round, so apps can correlate files with the tool that wrote them.
Responses follows the OpenAI artifact shape: produced files are attached to assistant output_text content as container_file_citation annotations. The same bytes remain available through GET /v1/files/{id}/content; OpenAI-style clients can also fetch them through GET /v1/containers/{container_id}/files/{file_id}/content.
User-provided files use OpenAI-compatible request shapes: upload with POST /v1/files, reference file_id, or attach inline file_data. Responses also supports input_file.file_url.
Text-like UTF-8 input files get bounded decoded previews. When agentic tools are active, the model can request additional slices if the preview is not enough. Binary files are metadata-only in prompt context, but are still downloadable and mounted into shell/code workdirs when those tools are active. See OpenAI-compatible file inputs.
Behavior worth designing around:
- Inline vs fetched: bodies up to 8 MB are inlined (
textordata_base64); larger bodies are elided from the wire and fetched viaGET /v1/files/{id}/content.is_truncated()on the SDKFilereports an elided body. - Context preview: input files expose decoded text previews of up to 4096 chars per file and 32768 chars per request. Agent-produced text outputs expose a 1024-byte preview. Agentic runs can inspect more text when the relevant file-access tool is available.
- Undeclared outputs: the Python executor and shell tools accept an
outputsparameter for files the model wrote but the request did not declare. Shell also advertisesmistralrs_surface_outputs, which lets the model surface files created in earlier shell calls. Files declared viarequest.filesare surfaced regardless; missing declared files come back as error placeholders. Files written but not named inoutputs,mistralrs_surface_outputs, orrequest.filesremain internal to the session.
The exact file schema, metadata endpoint, and content-endpoint status codes are in the HTTP API reference.
Sessions
Section titled “Sessions”Use session_id when your app needs continuity across requests: message history, tool records, media, and code-execution state. Session behavior, the export/import/delete endpoints, and lifetime rules live in persist sessions.
SDK boundaries
Section titled “SDK boundaries”| Surface | Current behavior |
|---|---|
| HTTP | Best surface for live model chunks, tool-progress timelines, files, and agent approval events. |
| Rust SDK | Supports request input files via InputFile and RequestBuilder::with_input_file(...); Model::stream_chat_request yields raw Response::AgenticToolCallProgress events. |
| Python SDK | Supports request input files via InputFile, plus agentic requests, callbacks, code execution, shell, local skill mounts, and sessions. The streaming iterator currently yields model chunks; use HTTP SSE for the full timeline. |
| Web UI | Renders code execution, shell, search, reasoning blocks, generated media, and approval cards inline. |
Full examples: Rust file inputs, Python file inputs, server file inputs, Rust agent, Rust agent streaming, Python agentic tools, HTTP tool rounds, and server Skills.
Security
Section titled “Security”Code and shell execution run with the permissions of the configured subprocess, inside the sandbox where enabled. Agent mode defaults to the developer sandbox profile, which keeps writes scoped to the session workdir while allowing common local toolchains to run. For untrusted workloads, set profile = "restricted" and tighter network settings in the TOML config, or use the matching CLI flags. Use agent_permission: "ask" or "deny" when an app needs tighter control over server-executed actions; a server-wide ask or deny cannot be loosened by the request (see permissions and approvals). For untrusted users, run mistral.rs in a container or VM, use a low-privilege user, and constrain network access.