Serve a model as an API
mistralrs serve exposes a model over an OpenAI-compatible HTTP API. The model used here is Google’s Gemma 4. If you prefer to stay on the Qwen model from Tutorial 1, substitute Qwen/Qwen3-4B for google/gemma-4-E4B-it throughout and skip the license step below.
Accepting the Gemma license
Section titled “Accepting the Gemma license”Gemma weights are gated on Hugging Face. One-time setup per account:
- Open huggingface.co/google/gemma-4-E4B-it, sign in, and accept the license at the top of the page.
- Create a read-only access token at huggingface.co/settings/tokens.
- Pass the token to mistral.rs:
mistralrs loginThe token is saved to ~/.cache/huggingface/token and reused for subsequent downloads. If you have already logged in via huggingface-cli, skip this step — both tools read the same token file.
Starting the server
Section titled “Starting the server”mistralrs serve -m google/gemma-4-E4B-itThe first run downloads the weights. When loading completes:
Server listening on http://0.0.0.0:1234The server binds 0.0.0.0 by default, making it reachable from any host on the network. To restrict it, pass --host 127.0.0.1. The port is configurable with --port.
Leave the server running and open a second terminal.
Sending a request with curl
Section titled “Sending a request with curl”The server implements the OpenAI Chat Completions protocol. Request bodies match OpenAI’s, with one difference: the model name is default.
curl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [ {"role": "user", "content": "In one sentence, what is Gemma?"} ] }'The response is JSON with a choices array, a usage block, and the generated text in choices[0].message.content.
When the server is started with a single -m flag, the model is registered under the reserved name default. With multiple models, each is registered under its configured name and default is not used.
Calling it from Python
Section titled “Calling it from Python”The official openai package works without modification:
pip install openaifrom openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create( model="default", messages=[ {"role": "user", "content": "In one sentence, what is Gemma?"} ],)
print(response.choices[0].message.content)The api_key field is required by the client but not validated by the server.
Streaming uses the OpenAI streaming protocol:
stream = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Write me a haiku about Rust."}], stream=True,)
for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)The built-in web UI
Section titled “The built-in web UI”The web UI is mounted at /ui by default whenever you run mistralrs serve:
mistralrs serve -m google/gemma-4-E4B-itOpen http://localhost:1234/ui. The UI provides a chat window with markdown rendering, reasoning blocks, and controls for sampling parameters and the system prompt. Pass --no-ui if you want the HTTP endpoints only.
The server implements the Chat Completions, legacy Completions, and Responses APIs. The OpenAI compatibility reference lists the supported and ignored fields.
The default model name is special-cased server-side: when the request’s model field is "default" or absent, the server uses the configured default model. GET /v1/models lists the real model id.
Next steps
Section titled “Next steps”- Tutorial 3: load a model directly inside a Python program, without an HTTP server.
- Tutorial 5: enable tool calling, web search, and code execution on the running server.
- The serving guides cover multi-model serving and configuration.