Skip to content

Use embedding models

Embedding models map text to dense vectors for semantic search, reranking, clustering, and downstream retrieval. mistral.rs serves embeddings through the standard OpenAI POST /v1/embeddings endpoint, so any tool that already targets that endpoint (LangChain, LlamaIndex, vector stores) works unchanged.

Two regularly tested options:

  • google/embeddinggemma-300m: 300M parameters, 768-dim vectors. Good general-purpose default.
  • Qwen/Qwen3-Embedding-0.6B: 600M parameters, 1024-dim vectors.
Terminal window
mistralrs serve -m google/embeddinggemma-300m

Use Qwen/Qwen3-Embedding-0.6B the same way:

Terminal window
mistralrs serve -m Qwen/Qwen3-Embedding-0.6B
Terminal window
curl http://localhost:1234/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": "The cat sat on the mat."
}'

The response includes the vector in embedding:

{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [0.123, -0.456, 0.789, ...]
}],
"model": "default",
"usage": {"prompt_tokens": 7, "total_tokens": 7}
}

Pass a list of strings to embed many at once:

Terminal window
curl http://localhost:1234/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": [
"The cat sat on the mat.",
"A dog chased a squirrel.",
"A raven croaked from the fencepost."
]
}'

The data array has one entry per input in input order.

Beyond model and input:

  • encoding_format: "float" (default) returns float arrays. "base64" returns each vector as base64-encoded little-endian f32 bytes. This matches OpenAI's compact form.
  • dimensions: not supported. Passing it returns a validation error rather than silently truncating.
  • truncate_sequence: mistral.rs extension. true truncates inputs that exceed the model's context length instead of erroring.

Whether vectors are L2-normalized is model-dependent: mistral.rs applies the modules the model ships in modules.json. EmbeddingGemma includes a Normalize module, so its output is already unit-norm; Qwen3-Embedding does not, so its vectors are returned unnormalized. If you compute cosine similarity as a raw dot product and your model does not normalize, normalize on the client first.

To normalize in Python:

import numpy as np
v = np.array(response["data"][0]["embedding"])
v_normalized = v / np.linalg.norm(v)

Many vector stores (FAISS, pgvector) handle normalization internally.

EmbeddingGemma works best when the input is prefixed for the task:

| Use case | Prompt form | |---|---| | Retrieval query | task: search result \| query: <query> | | Retrieval document | title: <title or none> \| text: <document> | | Question answering | task: question answering \| query: <question> | | Fact verification | task: fact checking \| query: <claim> | | Classification | task: classification \| query: <text> | | Clustering | task: clustering \| query: <text> | | Semantic similarity | task: sentence similarity \| query: <text> | | Code retrieval | task: code retrieval \| query: <query> |

Example:

Terminal window
curl http://localhost:1234/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": [
"task: search result | query: What is graphene?",
"title: none | text: Graphene is a single layer of carbon atoms."
]
}'

Qwen3-Embedding does not require these prefixes, but task-specific prefixes can still help keep a retrieval system consistent.

Standard pipeline:

  1. Embed a corpus offline and store vectors in a vector database (FAISS, Qdrant, pgvector, Pinecone).
  2. At query time, embed the query with the same model.
  3. Search the store for nearest neighbors.
  4. Optionally rerank top results with a reranker.
  5. Feed retrieved documents as language model context.

mistral.rs handles steps 1, 2, and (with a reranker) 4. The rest is the vector store and application logic. The web search guide covers using embeddings to rerank search results within an agent.