Skip to content

Python SDK getting started

The Python SDK loads the model in-process and wraps the same Rust engine that backs the mistralrs binary.

from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-4B"),
in_situ_quant="4",
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{"role": "user", "content": "In one sentence, what is Rust known for?"}
],
max_tokens=256,
)
)
print(response.choices[0].message.content)

The first run downloads the weights into the Hugging Face cache.

pip install mistralrs covers CPU (Linux x86_64/aarch64, Windows) and Metal (macOS arm64) - one package; pip picks the wheel for your platform. Python 3.10 or newer.

Terminal window
pip install mistralrs

CUDA wheels are published as GitHub release assets, one per compute capability (PyPI has no GPU dimension). Install with --find-links pointed at the release and select your GPU’s compute capability via the +smNN version (replace 0.8.4 / v0.8.4 with the release you want):

Terminal window
pip install "mistralrs==0.8.4+sm89" \
--find-links https://github.com/EricLBuehler/mistral.rs/releases/expanded_assets/v0.8.4

Look up your GPU’s compute capability in hardware support, which lists the published wheels per architecture. The wheels bundle the CUDA runtime, so no system toolkit is needed; they use the CUTLASS MoE backend (for the faster cuTile path, use the prebuilt binary).

All install paths expose the same from mistralrs import ... API.

Runner owns the loaded model. Construction loads the weights; reuse one Runner for the lifetime of the process to avoid reloading.

Which selects the model loader. Which.Plain(model_id="...") is correct for standard text models. Other variants cover multimodal models (Which.MultimodalPlain), GGUF checkpoints (Which.GGUF), embeddings (Which.Embedding), and LoRA adapters (Which.Lora).

in_situ_quant="4" is the equivalent of the CLI’s --isq 4: it applies ISQ (in-situ quantization), quantizing the weights to 4 bits at load time. Omit it for full precision.

Full example: plain.

Set stream=True to receive an iterator of chunks instead of a single response:

from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-4B"),
in_situ_quant="4",
)
stream = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Write me a haiku about ownership."}],
max_tokens=128,
stream=True,
)
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()

Each chunk is a ChatCompletionChunkResponse with the OpenAI streaming shape. choices[0].delta.content carries one incremental piece of the reply; it can be None (for example on the final chunk, which carries finish_reason), which is why the example checks delta before printing.

Full example: streaming. For async iteration, FastAPI integration, and mid-stream error handling, see streaming from Python.

The Runner keeps the model in memory for the process lifetime. Requests can be sent sequentially or from multiple threads, all reusing the loaded weights. To swap models, construct a new Runner; the old one releases GPU memory when it goes out of scope.

Chat history is not tracked. Each call to send_chat_completion_request is independent; multi-turn conversation means assembling the messages list yourself, appending each new user question and prior assistant reply.

The full Python surface (embeddings, speech, image generation, multimodal requests) is documented in the Python reference.