Use embedding models

Embedding models map text to dense vectors for semantic search, reranking, clustering, and downstream retrieval. mistral.rs serves embeddings through the standard OpenAI POST /v1/embeddings endpoint, so any tool that already targets that endpoint (LangChain, LlamaIndex, vector stores) works unchanged.

Loading an embedding model

Two regularly tested options:

google/embeddinggemma-300m: 300M parameters, 768-dim vectors. Good general-purpose default.
Qwen/Qwen3-Embedding-0.6B: 600M parameters, 1024-dim vectors.

mistralrs serve -m google/embeddinggemma-300m

Use Qwen/Qwen3-Embedding-0.6B the same way:

mistralrs serve -m Qwen/Qwen3-Embedding-0.6B

Requesting an embedding

curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": "The cat sat on the mat."
  }'

The response includes the vector in embedding:

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.123, -0.456, 0.789, ...]
  }],
  "model": "default",
  "usage": {"prompt_tokens": 7, "total_tokens": 7}
}

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

embeddings = runner.send_embedding_request(
    EmbeddingRequest(
        input=[
            "task: search result | query: What is graphene?",
            "task: search result | query: What is an apple?",
        ],
        truncate_sequence=True,
    )
)
print(len(embeddings), len(embeddings[0]))

For Qwen3-Embedding, set arch=EmbeddingArchitecture.Qwen3Embedding and drop the EmbeddingGemma prompt prefixes.

use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m")
    .build()
    .await?;

let embeddings = model
    .generate_embeddings(
        EmbeddingRequest::builder()
            .add_prompt("task: search result | query: What is graphene?"),
    )
    .await?;
println!("{:?}", embeddings.first());

Batching

Pass a list of strings to embed many at once:

curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": [
      "The cat sat on the mat.",
      "A dog chased a squirrel.",
      "A raven croaked from the fencepost."
    ]
  }'

The data array has one entry per input in input order.

Request fields

Beyond model and input:

encoding_format: "float" (default) returns float arrays. "base64" returns each vector as base64-encoded little-endian f32 bytes. This matches OpenAI's compact form.
dimensions: not supported. Passing it returns a validation error rather than silently truncating.
truncate_sequence: mistral.rs extension. true truncates inputs that exceed the model's context length instead of erroring.

Normalization

Whether vectors are L2-normalized is model-dependent: mistral.rs applies the modules the model ships in modules.json. EmbeddingGemma includes a Normalize module, so its output is already unit-norm; Qwen3-Embedding does not, so its vectors are returned unnormalized. If you compute cosine similarity as a raw dot product and your model does not normalize, normalize on the client first.

To normalize in Python:

import numpy as np

v = np.array(response["data"][0]["embedding"])
v_normalized = v / np.linalg.norm(v)

Many vector stores (FAISS, pgvector) handle normalization internally.

EmbeddingGemma prompts

EmbeddingGemma works best when the input is prefixed for the task:

| Use case | Prompt form | |---|---| | Retrieval query | task: search result \| query: <query> | | Retrieval document | title: <title or none> \| text: <document> | | Question answering | task: question answering \| query: <question> | | Fact verification | task: fact checking \| query: <claim> | | Classification | task: classification \| query: <text> | | Clustering | task: clustering \| query: <text> | | Semantic similarity | task: sentence similarity \| query: <text> | | Code retrieval | task: code retrieval \| query: <query> |

Example:

curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": [
      "task: search result | query: What is graphene?",
      "title: none | text: Graphene is a single layer of carbon atoms."
    ]
  }'

Qwen3-Embedding does not require these prefixes, but task-specific prefixes can still help keep a retrieval system consistent.

Using the vectors

Standard pipeline:

Embed a corpus offline and store vectors in a vector database (FAISS, Qdrant, pgvector, Pinecone).
At query time, embed the query with the same model.
Search the store for nearest neighbors.
Optionally rerank top results with a reranker.
Feed retrieved documents as language model context.

mistral.rs handles steps 1, 2, and (with a reranker) 4. The rest is the vector store and application logic. The web search guide covers using embeddings to rerank search results within an agent.