Use embedding models
Embedding models map text to dense vectors for semantic search, reranking, clustering, and downstream retrieval. mistral.rs serves embeddings through the standard OpenAI POST /v1/embeddings endpoint, so any tool that already targets that endpoint (LangChain, LlamaIndex, vector stores) works unchanged.
Loading an embedding model
Section titled “Loading an embedding model”Two regularly tested options:
google/embeddinggemma-300m: 300M parameters, 768-dim vectors. Good general-purpose default.Qwen/Qwen3-Embedding-0.6B: 600M parameters, 1024-dim vectors.
mistralrs serve -m google/embeddinggemma-300mUse Qwen/Qwen3-Embedding-0.6B the same way:
mistralrs serve -m Qwen/Qwen3-Embedding-0.6BRequesting an embedding
Section titled “Requesting an embedding”curl http://localhost:1234/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "default", "input": "The cat sat on the mat." }'The response includes the vector in embedding:
{ "object": "list", "data": [{ "object": "embedding", "index": 0, "embedding": [0.123, -0.456, 0.789, ...] }], "model": "default", "usage": {"prompt_tokens": 7, "total_tokens": 7}}from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which
runner = Runner( which=Which.Embedding( model_id="google/embeddinggemma-300m", arch=EmbeddingArchitecture.EmbeddingGemma, ))
embeddings = runner.send_embedding_request( EmbeddingRequest( input=[ "task: search result | query: What is graphene?", "task: search result | query: What is an apple?", ], truncate_sequence=True, ))print(len(embeddings), len(embeddings[0]))For Qwen3-Embedding, set arch=EmbeddingArchitecture.Qwen3Embedding and drop the EmbeddingGemma prompt prefixes.
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};
let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m") .build() .await?;
let embeddings = model .generate_embeddings( EmbeddingRequest::builder() .add_prompt("task: search result | query: What is graphene?"), ) .await?;println!("{:?}", embeddings.first());Batching
Section titled “Batching”Pass a list of strings to embed many at once:
curl http://localhost:1234/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "default", "input": [ "The cat sat on the mat.", "A dog chased a squirrel.", "A raven croaked from the fencepost." ] }'The data array has one entry per input in input order.
Request fields
Section titled “Request fields”Beyond model and input:
encoding_format:"float"(default) returns float arrays."base64"returns each vector as base64-encoded little-endian f32 bytes. This matches OpenAI's compact form.dimensions: not supported. Passing it returns a validation error rather than silently truncating.truncate_sequence: mistral.rs extension.truetruncates inputs that exceed the model's context length instead of erroring.
Normalization
Section titled “Normalization”Whether vectors are L2-normalized is model-dependent: mistral.rs applies the modules the model ships in modules.json. EmbeddingGemma includes a Normalize module, so its output is already unit-norm; Qwen3-Embedding does not, so its vectors are returned unnormalized. If you compute cosine similarity as a raw dot product and your model does not normalize, normalize on the client first.
To normalize in Python:
import numpy as np
v = np.array(response["data"][0]["embedding"])v_normalized = v / np.linalg.norm(v)Many vector stores (FAISS, pgvector) handle normalization internally.
EmbeddingGemma prompts
Section titled “EmbeddingGemma prompts”EmbeddingGemma works best when the input is prefixed for the task:
| Use case | Prompt form |
|---|---|
| Retrieval query | task: search result \| query: <query> |
| Retrieval document | title: <title or none> \| text: <document> |
| Question answering | task: question answering \| query: <question> |
| Fact verification | task: fact checking \| query: <claim> |
| Classification | task: classification \| query: <text> |
| Clustering | task: clustering \| query: <text> |
| Semantic similarity | task: sentence similarity \| query: <text> |
| Code retrieval | task: code retrieval \| query: <query> |
Example:
curl http://localhost:1234/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "default", "input": [ "task: search result | query: What is graphene?", "title: none | text: Graphene is a single layer of carbon atoms." ] }'Qwen3-Embedding does not require these prefixes, but task-specific prefixes can still help keep a retrieval system consistent.
Using the vectors
Section titled “Using the vectors”Standard pipeline:
- Embed a corpus offline and store vectors in a vector database (FAISS, Qdrant, pgvector, Pinecone).
- At query time, embed the query with the same model.
- Search the store for nearest neighbors.
- Optionally rerank top results with a reranker.
- Feed retrieved documents as language model context.
mistral.rs handles steps 1, 2, and (with a reranker) 4. The rest is the vector store and application logic. The web search guide covers using embeddings to rerank search results within an agent.