Skip to content

Vision model walkthroughs

All supported vision families use the same OpenAI-style multimodal message shape. The differences are model IDs, supported modalities, and a few model-specific flags.

CLI:

Terminal window
mistralrs run -m Qwen/Qwen3-VL-4B-Instruct --quant 4 --image photo.jpg -i "Describe this image"
mistralrs serve -m Qwen/Qwen3-VL-4B-Instruct --quant 4 -p 1234

HTTP:

from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///absolute/path/photo.jpg"}},
{"type": "text", "text": "What is this?"},
],
}
],
max_tokens=256,
)
print(completion.choices[0].message.content)

Python SDK:

from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which
runner = Runner(
which=Which.MultimodalPlain(
model_id="Qwen/Qwen3-VL-4B-Instruct",
arch=MultimodalArchitecture.Qwen3VL,
),
in_situ_quant="4",
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///absolute/path/photo.jpg"}},
{"type": "text", "text": "What is this?"},
],
}
],
max_tokens=256,
)
)
print(response.choices[0].message.content)

Use file:// URLs for local files, https:// for remote files, and data:image/...;base64,... for inline images.

FamilyExample modelPython architectureModalitiesNotes
Gemma 3google/gemma-3-12b-itMultimodalArchitecture.Gemma3image128k context vision-language family.
Gemma 3ngoogle/gemma-3n-E4B-itMultimodalArchitecture.Gemma3nimage, audio, videoMatFormer slices can trade quality for memory.
Gemma 4google/gemma-4-E4B-itMultimodalArchitecture.Gemma4image, audio, videoSupports strict tool grammar and mixed media in one message.
Idefics 2HuggingFaceM4/idefics2-8bMultimodalArchitecture.Idefics2imageOlder but useful image-text family.
Idefics 3 / SmolVLMHuggingFaceM4/Idefics3-8B-Llama3MultimodalArchitecture.Idefics3imageSmolVLM follows the same loader path.
LLaVA / LLaVA Nextllava-hf/llava-v1.6-mistral-7b-hfMultimodalArchitecture.LLaVANextimageVicuna-backed checkpoints need the Vicuna chat template.
Llama 3.2 Visionmeta-llama/Llama-3.2-11B-Vision-InstructMultimodalArchitecture.VLlamaimageDevice mapping applies to the text backbone.
Llama 4meta-llama/Llama-4-Scout-17B-16E-InstructMultimodalArchitecture.Llama4imageSparse multimodal model with tool calling and web-search support.
MiniCPM-O 2.6openbmb/MiniCPM-o-2_6MultimodalArchitecture.MiniCpmOimage, audioCheck the supported models reference when modality support matters.
Mistral Small 3mistralai/Mistral-Small-3.2-24B-Instruct-2506MultimodalArchitecture.Mistral3imageTool calling requires the provided Mistral Small tool-call template.
Phi 3.5 Visionmicrosoft/Phi-3.5-vision-instructMultimodalArchitecture.Phi3VimageBest with one image; multiple images are resized together.
Phi 4 Multimodalmicrosoft/Phi-4-multimodal-instructMultimodalArchitecture.Phi4MMimage, audioAudio and image can be sent in the same message.
Qwen2-VL / Qwen2.5-VLQwen/Qwen2-VL-7B-InstructMultimodalArchitecture.Qwen2VL / MultimodalArchitecture.Qwen2_5VLimage, videoGood baseline Qwen vision family.
Qwen3-VLQwen/Qwen3-VL-4B-InstructMultimodalArchitecture.Qwen3VL / MultimodalArchitecture.Qwen3VLMoEimage, videoDense and MoE variants. MoE variants support MoQE.
Qwen3.5Qwen/Qwen3.5-27BMultimodalArchitecture.Qwen3_5 / MultimodalArchitecture.Qwen3_5MoeimageDense and MoE variants. MoE variants support MoQE.

Use --video on the CLI or a video_url content part over HTTP:

Terminal window
mistralrs run -m google/gemma-4-E4B-it --quant 8 --video clip.mp4 -i "Summarize this clip."
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "file:///absolute/path/clip.mp4"}},
{"type": "text", "text": "What happens in this video?"}
]
}

FFmpeg requirements, supported containers, and platform install commands are centralized in Set up video input. Per-request frame-sampling controls are not currently exposed.

Gemma 4, Gemma 3n, Phi 4 Multimodal, MiniCPM-O, and Voxtral can accept audio content parts when supported by the selected checkpoint:

{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "file:///absolute/path/audio.wav"}},
{"type": "image_url", "image_url": {"url": "file:///absolute/path/photo.jpg"}},
{"type": "text", "text": "Describe what you hear and see."}
]
}

WAV, MP3, FLAC, and OGG are decoded natively. Other formats require FFmpeg conversion; see Set up video input for FFmpeg installation.

Gemma 3n supports dynamic model slicing. Use this when you want one checkpoint to cover several memory and latency budgets:

Terminal window
mistralrs run -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"

The same fields exist in the Python selector as matformer_config_path and matformer_slice_name. Without a slice, the default configuration loads.

The bundled matformer_configs/gemma3n.csv includes the full E4B configuration, the official E2B slice, and intermediate E1.96B-E3.79B slices. Use the full configuration for quality and smaller slices for constrained devices.

Mistral Small 3 checkpoints can do tool calling, but some model repos do not ship the correct chat template. Use the bundled template when you need tools:

Terminal window
mistralrs serve -p 1234 --quant 4 \
--jinja-explicit chat_templates/mistral_small_tool_call.jinja \
-m mistralai/Mistral-Small-3.2-24B-Instruct-2506

Mistral-backed LLaVA checkpoints usually work with the default template. Vicuna-backed checkpoints need the Vicuna template:

Terminal window
mistralrs run -m llava-hf/llava-v1.6-vicuna-7b-hf \
--quant 4 \
-c ./chat_templates/vicuna.json \
--image photo.jpg \
-i "Describe this image"

For most multimodal models, the text backbone contains most parameters. Device mapping and topology mainly apply to that text portion; the vision, audio, or video encoder stays on its supported device path.

For MoE Qwen3-VL and Qwen3.5 variants, combine ISQ with MoQE when expert memory dominates. This uses --isq because MoQE is an explicit runtime ISQ layout:

Terminal window
mistralrs run -m Qwen/Qwen3-VL-235B-A22B-Instruct \
--isq 4 \
--isq-organization moqe \
--image photo.jpg \
-i "Describe this image"

The same setting is organization=IsqOrganization.MoQE in Which.MultimodalPlain(...).

Long-form SDK examples live in the repository so they can stay checked against the current APIs: