The multimodal pipeline
Multimodal requests carry non-text content parts (image_url, audio, video). Each part goes through a per-modality preprocessing and encoder path before the tokens reach the transformer.
Request shape
Section titled “Request shape”Content parts are ordered in the request body. The engine preserves that order when it interleaves media tokens with text tokens, so text surrounding an image appears on either side of the image tokens in the final token sequence. The transformer sees one uniform token stream.
Image path
Section titled “Image path”- Decode. The URL form (
file://,http(s)://,data:image/...;base64,) is resolved to a pixel buffer. HTTP URLs trigger a fetch. - Preprocess. Model-specific: resize to the vision encoder’s expected resolution, normalize per-channel, tensorize.
- Encode. The vision encoder produces a sequence of patch embeddings.
- Project. A learned projection maps patch embeddings to the transformer’s hidden dimension.
- Place. The projected tokens are inserted at the position corresponding to the content part in the user’s request.
Multiple images in one request are encoded as a batch.
Video path
Section titled “Video path”Video is decoded to frames before model preprocessing. Each selected frame then flows through the image path.
Supported containers and FFmpeg installation are covered in Set up video input.
Audio path
Section titled “Audio path”- Decode. The audio file is decoded to PCM at the model’s expected sample rate. FFmpeg handles non-native formats.
- Feature extraction. Mel-spectrogram or similar.
- Encode. A model-specific audio encoder produces a sequence of vectors.
- Project and place. As with images.
Encoder cache
Section titled “Encoder cache”Encoder outputs are cached by content hash and modality. When the same image, video, or audio clip appears in a later request, or in a later turn of the same session, the encoder pass is skipped and the cached tokens are reused.
The modality is part of the key: identical bytes processed as an image versus as a video frame can yield different token counts, and the cache keeps them separate.
The cache is LRU with a fixed capacity per model. Hit and miss counters are exposed for observability.
See also
Section titled “See also”- Guide: work with vision and video input.
- Setup: set up video input.
- Reference: supported models.