Skip to content

Stream tokens from Python

Streaming displays output as it generates rather than after the full response. The Python SDK exposes streaming as a plain iterator usable from sync and async code.

The simplest pattern is a for loop:

from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(Which.Plain(model_id="Qwen/Qwen3-4B"), in_situ_quant="4")
stream = runner.send_chat_completion_request(
ChatCompletionRequest(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": "Explain prime numbers."}],
max_tokens=256,
stream=True,
)
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()

Each chunk’s delta.content is a string (one step of new output) or None. None appears at stream start (role set, no text yet) and end (finish reason emitted).

The SDK does not expose a native async iterator. Wrap the synchronous iterator in an executor:

import asyncio
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(Which.Plain(model_id="Qwen/Qwen3-4B"), in_situ_quant="4")
async def stream_response(prompt: str):
stream = runner.send_chat_completion_request(
ChatCompletionRequest(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
)
loop = asyncio.get_event_loop()
while True:
chunk = await loop.run_in_executor(None, next, stream, None)
if chunk is None:
break
delta = chunk.choices[0].delta.content
if delta:
yield delta

Consume with async for:

async def main():
async for delta in stream_response("Write a haiku."):
print(delta, end="", flush=True)

For FastAPI, the same pattern works as a response generator:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
runner = Runner(Which.Plain(model_id="Qwen/Qwen3-4B"), in_situ_quant="4")
@app.get("/stream")
async def stream(prompt: str):
def iter():
s = runner.send_chat_completion_request(
ChatCompletionRequest(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
)
for chunk in s:
delta = chunk.choices[0].delta.content
if delta:
yield delta
return StreamingResponse(iter(), media_type="text/event-stream")

For production, run mistralrs as an HTTP server (see Tutorial 2) and call it with the OpenAI Python client rather than loading the model in the web app process. The HTTP server’s streaming is more robust under load.

Streaming can fail mid-response: out of memory, generation failure, client disconnect. Wrap the loop:

try:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
except Exception as e:
print(f"\n\nStream ended: {e}", file=sys.stderr)

The engine flushes generated content before raising. Partial output is preserved.