Skip to content

Serve a model as an API

mistralrs serve exposes a model over an OpenAI-compatible HTTP API. The model used here is Google’s Gemma 4. If you prefer to stay on the Qwen model from Tutorial 1, substitute Qwen/Qwen3-4B for google/gemma-4-E4B-it throughout and skip the license step below.

Gemma weights are gated on Hugging Face. One-time setup per account:

  1. Open huggingface.co/google/gemma-4-E4B-it, sign in, and accept the license at the top of the page.
  2. Create a read-only access token at huggingface.co/settings/tokens.
  3. Pass the token to mistral.rs:
Terminal window
mistralrs login

The token is saved to ~/.cache/huggingface/token and reused for subsequent downloads. If you have already logged in via huggingface-cli, skip this step — both tools read the same token file.

Terminal window
mistralrs serve -m google/gemma-4-E4B-it

The first run downloads the weights. When loading completes:

Server listening on http://0.0.0.0:1234

The server binds 0.0.0.0 by default, making it reachable from any host on the network. To restrict it, pass --host 127.0.0.1. The port is configurable with --port.

Leave the server running and open a second terminal.

The server implements the OpenAI Chat Completions protocol. Request bodies match OpenAI’s, with one difference: the model name is default.

Terminal window
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "In one sentence, what is Gemma?"}
]
}'

The response is JSON with a choices array, a usage block, and the generated text in choices[0].message.content.

When the server is started with a single -m flag, the model is registered under the reserved name default. With multiple models, each is registered under its configured name and default is not used.

The official openai package works without modification:

Terminal window
pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "In one sentence, what is Gemma?"}
],
)
print(response.choices[0].message.content)

The api_key field is required by the client but not validated by the server.

Streaming uses the OpenAI streaming protocol:

stream = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Write me a haiku about Rust."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)

The web UI is mounted at /ui by default whenever you run mistralrs serve:

Terminal window
mistralrs serve -m google/gemma-4-E4B-it

Open http://localhost:1234/ui. The UI provides a chat window with markdown rendering, reasoning blocks, and controls for sampling parameters and the system prompt. Pass --no-ui if you want the HTTP endpoints only.

The server implements the Chat Completions, legacy Completions, and Responses APIs. The OpenAI compatibility reference lists the supported and ignored fields.

The default model name is special-cased server-side: when the request’s model field is "default" or absent, the server uses the configured default model. GET /v1/models lists the real model id.

  • Tutorial 3: load a model directly inside a Python program, without an HTTP server.
  • Tutorial 5: enable tool calling, web search, and code execution on the running server.
  • The serving guides cover multi-model serving and configuration.