EmbeddingGemma

EmbeddingGemma was the first embedding model supported by mistral.rs. This guide walks through serving the model via the OpenAI-compatible HTTP server, running it from Python, and embedding text directly in Rust.

For a catalog of available embedding models and general usage tips, see EMBEDDINGS.md.

Prompt instructions

EmbeddingGemma can generate optimized embeddings for various use cases-such as document retrieval, question answering, and fact verification-or for specific input types, either, a query or a document-using prompts that are prepended to the input strings.

Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result.
Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.

Use Case (task type enum)	Descriptions	Recommended Prompt
Retrieval (Query)	Used to generate embeddings that are optimized for document search or information retrieval.	`task: search result \| query: {content}`
Retrieval (Document)	Used to generate embeddings that are optimized for document search or information retrieval (document side).	`title: {title \| "none"} \| text: {content}`
Question Answering	Used to generate embeddings that are optimized for answering natural language questions.	`task: question answering \| query: {content}`
Fact Verification	Used to generate embeddings that are optimized for verifying factual correctness.	`task: fact checking \| query: {content}`
Classification	Used to generate embeddings that are optimized to classify texts according to preset labels.	`task: classification \| query: {content}`
Clustering	Used to generate embeddings that are optimized to cluster texts based on their similarities.	`task: clustering \| query: {content}`
Semantic Similarity	Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.	`task: sentence similarity \| query: {content}`
Code Retrieval	Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of code blocks are computed using `retrieval_document`.	`task: code retrieval \| query: {content}`

HTTP server

Launch the server in embedding mode to expose an OpenAI-compatible /v1/embeddings endpoint:

mistralrs serve -p 1234 -m google/embeddinggemma-300m

Once running, call the endpoint with an OpenAI client or raw curl:

curl http://localhost:1234/v1/embeddings \
  -H "Authorization: Bearer EMPTY" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "input": ["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"]}'

An example with the OpenAI client can be found here.

By default the server registers the model as default. To expose it under a custom name or alongside chat models, run in multi-model mode and assign an identifier in the selector configuration:

{
  "embed-gemma": {
    "Embedding": {
      "model_id": "google/embeddinggemma-300m",
      "arch": "embeddinggemma"
    }
  }
}

See docs/HTTP.md for the full request schema and response layout.

Python SDK

Instantiate Runner with the Which.Embedding selector and request EmbeddingGemma explicitly. The helper method send_embedding_request returns batched embeddings as Python lists.

from mistralrs import EmbeddingArchitecture, EmbeddingRequest, Runner, Which

runner = Runner(
    which=Which.Embedding(
        model_id="google/embeddinggemma-300m",
        arch=EmbeddingArchitecture.EmbeddingGemma,
    )
)

request = EmbeddingRequest(
    input=["task: search result | query: What is graphene?", "task: search result | query: What is an apple?"],
    truncate_sequence=True,
)

embeddings = runner.send_embedding_request(request)
print(len(embeddings), len(embeddings[0]))

Refer to this example for a complete runnable script.

Rust SDK

Use the EmbeddingModelBuilder helper from the mistralrs crate to create the model and submit an EmbeddingRequest:

use anyhow::Result;
use mistralrs::{EmbeddingModelBuilder, EmbeddingRequest};

#[tokio::main]
async fn main() -> Result<()> {
    let model = EmbeddingModelBuilder::new("google/embeddinggemma-300m")
        .with_logging()
        .build()
        .await?;

    let embeddings = model
        .generate_embeddings(
            EmbeddingRequest::builder()
                .add_prompt("task: search result | query: What is graphene?")
        )
        .await?;

    println!("Returned {} vectors", embeddings.len());
    Ok(())
}

This example lives here, and can be run with:

cargo run --package mistralrs --example embedding_gemma

Keyboard shortcuts

mistral.rs Documentation

EmbeddingGemma

Prompt instructions

HTTP server

Python SDK

Rust SDK