Serve an OpenAI-compatible API
mistralrs serve puts a local model behind OpenAI-compatible endpoints under /v1. OpenAI SDKs and compatible clients work unchanged with http://localhost:1234/v1 as the base URL.
mistralrs serve -m Qwen/Qwen3-4BThen send a request:
curl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [ {"role": "user", "content": "Write a haiku about local inference."} ], "max_tokens": 128 }'With a single -m model, the request model is "default" (or omitted). In multi-model serving, use a model id exactly as it appears in GET /v1/models.
First time serving a model? The Quickstart walks through installation, Hugging Face authentication for gated models, and the first run.
OpenAI Python client
Section titled “OpenAI Python client”from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Say hello from mistral.rs."}],)
print(response.choices[0].message.content)The api_key is required by the client but not validated by the server; see authentication. Set stream=True for token-by-token output (full example).
Endpoints
Section titled “Endpoints”| Endpoint | Purpose |
|---|---|
GET /v1/models | List loaded models. |
POST /v1/chat/completions | Chat, streaming, tool calling, multimodal inputs, and mistral.rs agentic extensions. |
POST /v1/responses | OpenAI Responses API: response objects, polling, background runs, cancellation. |
POST /v1/skills | Upload Skills for OpenAI-compatible Responses or Anthropic-compatible Messages. |
GET /v1/skills | List uploaded skills. Anthropic headers return Anthropic-shaped list objects. |
GET, POST /v1/skills/{skill_id}/versions | List or upload versions of an existing skill. |
POST /v1/messages | Anthropic Messages API (base URL without /v1). |
POST /v1/completions | Legacy text completions. |
POST /v1/embeddings | Embedding generation. |
POST /v1/images/generations | Image generation. |
POST /v1/audio/speech | Text to speech. |
POST /v1/files | Upload OpenAI-compatible user files. |
GET /v1/files | List uploaded and generated files. |
Every path with full request and response schemas is in the generated HTTP API reference. Streaming events, authentication, and protocol semantics are in the HTTP API reference; field-level compatibility notes (including Responses API restrictions) are in OpenAI compatibility.
A live Swagger UI for the running server is at http://localhost:1234/docs.
Tools, structured output, and agentic features
Section titled “Tools, structured output, and agentic features”OpenAI-compatible function tools work on Chat Completions and Responses, including strict: true for JSON-Schema-constrained tool arguments. See tool calling.
response_format with json_schema and the grammar extension constrain output server-side. See structured output.
Start the server with agentic capabilities to use server-side tools and agentic fields. Chat Completions uses web_search_options for web search and tools: [{"type":"code_interpreter","container":{"type":"auto"}}] for code execution. Responses uses hosted tools in the tools array for web search, code execution, shell, and OpenAI-compatible Skills.
mistralrs serve --agent -m Qwen/Qwen3-4BFor tool timelines, generated files, search, code execution, shell, Skills, and session state, see agentic runtime for apps.
Configuration
Section titled “Configuration”-p/--port (default 1234) and --host (default 0.0.0.0) control the bind address. --no-ui disables the web UI at /ui. All flags are in the CLI reference; the equivalent config file for multi-model, repeatable deployments is the TOML config reference, which also covers CORS, body limits, authentication, and logging.
Examples
Section titled “Examples”Runnable client scripts live in examples/server/ and render under server examples:
| Example | What it shows |
|---|---|
| chat | Basic Chat Completions request. |
| streaming | Chat Completions streaming. |
| tool_calling | OpenAI-compatible function tools. |
| allowed_tools | OpenAI-compatible allowed_tools function subset selection. |
| openai_response_format | Structured output via response_format. |
| responses | Responses API request. |
| responses_tools | Responses hosted tools: web search and code interpreter. |
| skills | OpenAI-compatible Skills upload and execution. |
| responses_vision | Responses API with image input. |
| web_search | Search through OpenAI-compatible request fields. |
| anthropic_chat | Anthropic Messages request. |
| multi_model_chat | Routing requests across loaded models. |
For Codex and Claude Code setup, see coding agents.