Skip to content

Serve an OpenAI-compatible API

mistralrs serve puts a local model behind OpenAI-compatible endpoints under /v1. OpenAI SDKs and compatible clients work unchanged with http://localhost:1234/v1 as the base URL.

Terminal window
mistralrs serve -m Qwen/Qwen3-4B

Then send a request:

Terminal window
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Write a haiku about local inference."}
],
"max_tokens": 128
}'

With a single -m model, the request model is "default" (or omitted). In multi-model serving, use a model id exactly as it appears in GET /v1/models.

First time serving a model? The Quickstart walks through installation, Hugging Face authentication for gated models, and the first run.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Say hello from mistral.rs."}],
)
print(response.choices[0].message.content)

The api_key is required by the client but not validated by the server; see authentication. Set stream=True for token-by-token output (full example).

EndpointPurpose
GET /v1/modelsList loaded models.
POST /v1/chat/completionsChat, streaming, tool calling, multimodal inputs, and mistral.rs agentic extensions.
POST /v1/responsesOpenAI Responses API: response objects, polling, background runs, cancellation.
POST /v1/skillsUpload Skills for OpenAI-compatible Responses or Anthropic-compatible Messages.
GET /v1/skillsList uploaded skills. Anthropic headers return Anthropic-shaped list objects.
GET, POST /v1/skills/{skill_id}/versionsList or upload versions of an existing skill.
POST /v1/messagesAnthropic Messages API (base URL without /v1).
POST /v1/completionsLegacy text completions.
POST /v1/embeddingsEmbedding generation.
POST /v1/images/generationsImage generation.
POST /v1/audio/speechText to speech.
POST /v1/filesUpload OpenAI-compatible user files.
GET /v1/filesList uploaded and generated files.

Every path with full request and response schemas is in the generated HTTP API reference. Streaming events, authentication, and protocol semantics are in the HTTP API reference; field-level compatibility notes (including Responses API restrictions) are in OpenAI compatibility.

A live Swagger UI for the running server is at http://localhost:1234/docs.

Tools, structured output, and agentic features

Section titled “Tools, structured output, and agentic features”

OpenAI-compatible function tools work on Chat Completions and Responses, including strict: true for JSON-Schema-constrained tool arguments. See tool calling.

response_format with json_schema and the grammar extension constrain output server-side. See structured output.

Start the server with agentic capabilities to use server-side tools and agentic fields. Chat Completions uses web_search_options for web search and tools: [{"type":"code_interpreter","container":{"type":"auto"}}] for code execution. Responses uses hosted tools in the tools array for web search, code execution, shell, and OpenAI-compatible Skills.

Terminal window
mistralrs serve --agent -m Qwen/Qwen3-4B

For tool timelines, generated files, search, code execution, shell, Skills, and session state, see agentic runtime for apps.

-p/--port (default 1234) and --host (default 0.0.0.0) control the bind address. --no-ui disables the web UI at /ui. All flags are in the CLI reference; the equivalent config file for multi-model, repeatable deployments is the TOML config reference, which also covers CORS, body limits, authentication, and logging.

Runnable client scripts live in examples/server/ and render under server examples:

ExampleWhat it shows
chatBasic Chat Completions request.
streamingChat Completions streaming.
tool_callingOpenAI-compatible function tools.
allowed_toolsOpenAI-compatible allowed_tools function subset selection.
openai_response_formatStructured output via response_format.
responsesResponses API request.
responses_toolsResponses hosted tools: web search and code interpreter.
skillsOpenAI-compatible Skills upload and execution.
responses_visionResponses API with image input.
web_searchSearch through OpenAI-compatible request fields.
anthropic_chatAnthropic Messages request.
multi_model_chatRouting requests across loaded models.

For Codex and Claude Code setup, see coding agents.