Serve a model as an API

mistralrs serve exposes a model over an OpenAI-compatible HTTP API. The model used here is Google’s Gemma 4. If you prefer to stay on the Qwen model from Tutorial 1, substitute Qwen/Qwen3-4B for google/gemma-4-E4B-it throughout and skip the license step below.

Accepting the Gemma license

Gemma weights are gated on Hugging Face. One-time setup per account:

Open huggingface.co/google/gemma-4-E4B-it, sign in, and accept the license at the top of the page.
Create a read-only access token at huggingface.co/settings/tokens.
Pass the token to mistral.rs:

mistralrs login

The token is saved to ~/.cache/huggingface/token and reused for subsequent downloads. If you have already logged in via huggingface-cli, skip this step — both tools read the same token file.

Starting the server

mistralrs serve -m google/gemma-4-E4B-it

The first run downloads the weights. When loading completes:

Server listening on http://0.0.0.0:1234

The server binds 0.0.0.0 by default, making it reachable from any host on the network. To restrict it, pass --host 127.0.0.1. The port is configurable with --port.

Leave the server running and open a second terminal.

Sending a request with curl

The server implements the OpenAI Chat Completions protocol. Request bodies match OpenAI’s, with one difference: the model name is default.

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "In one sentence, what is Gemma?"}
    ]
  }'

The response is JSON with a choices array, a usage block, and the generated text in choices[0].message.content.

When the server is started with a single -m flag, the model is registered under the reserved name default. With multiple models, each is registered under its configured name and default is not used.

Calling it from Python

The official openai package works without modification:

pip install openai

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "In one sentence, what is Gemma?"}
    ],
)

print(response.choices[0].message.content)

The api_key field is required by the client but not validated by the server.

Streaming uses the OpenAI streaming protocol:

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write me a haiku about Rust."}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

The built-in web UI

The web UI is mounted at /ui by default whenever you run mistralrs serve:

mistralrs serve -m google/gemma-4-E4B-it

Open http://localhost:1234/ui. The UI provides a chat window with markdown rendering, reasoning blocks, and controls for sampling parameters and the system prompt. Pass --no-ui if you want the HTTP endpoints only.

Notes

The server implements the Chat Completions, legacy Completions, and Responses APIs. The OpenAI compatibility reference lists the supported and ignored fields.

The default model name is special-cased server-side: when the request’s model field is "default" or absent, the server uses the configured default model. GET /v1/models lists the real model id.

Next steps

Tutorial 3: load a model directly inside a Python program, without an HTTP server.
Tutorial 5: enable tool calling, web search, and code execution on the running server.
The serving guides cover multi-model serving and configuration.