Crate mistralrs

Expand description

This crate provides an asynchronous API to mistral.rs.

To get started loading a model, check out the following builders:

§Example

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

§Streaming example

   use anyhow::Result;
   use mistralrs::{
       IsqType, PagedAttentionMetaBuilder, Response, TextMessageRole, TextMessages,
       TextModelBuilder,
   };
   use mistralrs_core::{ChatCompletionChunkResponse, ChunkChoice, Delta};

   #[tokio::main]
   async fn main() -> Result<()> {
       let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
           .with_isq(IsqType::Q8_0)
           .with_logging()
           .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
           .build()
           .await?;

       let messages = TextMessages::new()
           .add_message(
               TextMessageRole::System,
               "You are an AI agent with a specialty in programming.",
           )
           .add_message(
               TextMessageRole::User,
               "Hello! How are you? Please write generic binary search function in Rust.",
           );

       let mut stream = model.stream_chat_request(messages).await?;
       while let Some(chunk) = stream.next().await {
           if let Response::Chunk(ChatCompletionChunkResponse { choices, .. }) = chunk {
               if let Some(ChunkChoice {
                   delta:
                       Delta {
                           content: Some(content),
                           ..
                       },
                   ..
               }) = choices.first()
               {
                   print!("{}", content);
               };
           }
       }
       Ok(())
   }

§MCP example

The MCP client integrates seamlessly with mistral.rs model builders:

use mistralrs::{TextModelBuilder, IsqType};
use mistralrs_core::{McpClientConfig, McpServerConfig, McpServerSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mcp_config = McpClientConfig {
        servers: vec![/* your server configs */],
        auto_register_tools: true,
        tool_timeout_secs: Some(30),
        max_concurrent_calls: Some(5),
    };
     
    let model = TextModelBuilder::new("path/to/model".to_string())
        .with_isq(IsqType::Q8_0)
        .with_mcp_client(mcp_config)  // MCP tools automatically registered
        .build()
        .await?;
     
    // MCP tools are now available for automatic tool calling
    Ok(())
}

Modules§

distributed
layers
llguidance
matformer
speech_utils

Structs§

AddModelConfig: Configuration for adding a model to MistralRs
AnyMoeConfig
AnyMoeLoader
AnyMoeModelBuilder
AnyMoePipeline
ApproximateUserLocation
AudioInput: Raw audio input consisting of PCM samples and a sample rate.
AutoLoader: Automatically selects between a normal or vision loader based on the architectures field.
AutoLoaderBuilder
CalledFunction: Called function with name and arguments
ChatCompletionChunkResponse: Chat completion streaming request chunk.
ChatCompletionResponse: An OpenAI compatible chat completion response.
ChatTemplate: Template for chat models including bos/eos/unk as well as the chat template.
Choice: Chat completion choice.
ChunkChoice: Chat completion streaming chunk choice.
CompletionChoice: Completion request choice.
CompletionChunkChoice: Chat completion streaming chunk choice.
CompletionChunkResponse: Completion request choice.
CompletionResponse: An OpenAI compatible completion response.
Delta: Delta in content for streaming response.
DetokenizationRequest: Request to detokenize some text.
DeviceLayerMapMetadata
DeviceMapMetadata: Metadata to initialize the device mapper.
DiffusionGenerationParams
DiffusionLoader: A loader for a vision (non-quantized) model.
DiffusionLoaderBuilder: A builder for a loader for a vision (non-quantized) model.
DiffusionModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
DrySamplingParams
EngineConfig: Configuration for creating an engine instance
Function: Function definition for a tool
GGMLLoader: A loader for a GGML model.
GGMLLoaderBuilder: A builder for a GGML loader.
GGMLSpecificConfig: Config for a GGML loader.
GGUFLoader: Loader for a GGUF model.
GGUFLoaderBuilder: A builder for a GGUF loader.
GGUFSpecificConfig: Config for a GGUF loader.
GemmaLoader: NormalLoader for a Gemma model.
GgufLoraModelBuilder: Wrapper of GgufModelBuilder for LoRA models.
GgufModelBuilder: Configure a text GGUF model with the various parameters for loading, running, and other inference behaviors.
GgufXLoraModelBuilder: Wrapper of GgufModelBuilder for X-LoRA models.
Idefics2Loader: VisionLoader for an Idefics 2 Vision model.
ImageChoice
ImageGenerationResponse
LLaVALoader: VisionLoader for an LLaVA Vision model.
LLaVANextLoader: VisionLoader for an LLaVANext Vision model.
LayerDeviceMapper: A device mapper which does device mapping per hidden layer.
LayerTopology
LlamaLoader: NormalLoader for a Llama model.
LoaderBuilder: A builder for a loader using the selected model.
LocalModelPaths: All local paths and metadata necessary to load a model.
Logprobs: Logprobs per token.
LoraAdapterPaths
LoraModelBuilder: Wrapper of TextModelBuilder for LoRA models.
McpClient: MCP client that manages connections to multiple MCP servers
McpClientConfig: Configuration for MCP client integration
McpServerConfig: Configuration for an individual MCP server
McpToolInfo: Information about a tool discovered from an MCP server
MemoryUsage
MistralLoader
MistralRs: The MistralRs struct handles sending requests to multiple engines. It is the core multi-threaded component of mistral.rs, and uses mpsc Sender and Receiver primitives to send and receive requests to the appropriate engine based on model ID.
MistralRsBuilder: The MistralRsBuilder takes the pipeline and a scheduler method and constructs an Engine and a MistralRs instance. The Engine runs on a separate thread, and the MistralRs instance stays on the calling thread.
MistralRsConfig
MixtralLoader
Modalities
Model: The object used to interact with the model. This can be used with many varietes of models,
and as such may be created with one of:
MultiModel: A simpler multi-model interface that wraps an existing MistralRs instance and provides methods to interact with multiple loaded models.
NormalLoader: A loader for a “normal” (non-quantized) model.
NormalLoaderBuilder: A builder for a loader for a “normal” (non-quantized) model.
NormalRequest: A normal request request to the MistralRs.
NormalSpecificConfig: Config specific to loading a normal model.
Ordering: Adapter model ordering information.
PagedAttentionConfig: All memory counts in MB. Default for block size is 32.
PagedAttentionMetaBuilder: Builder for PagedAttention metadata.
Phi2Loader: NormalLoader for a Phi 2 model.
Phi3Loader: NormalLoader for a Phi 3 model.
Phi3VLoader: VisionLoader for a Phi 3 Vision model.
Qwen2Loader: NormalLoader for a Qwen 2 model.
RequestBuilder: A way to add messages with finer control given.
ResponseLogprob: A logprob with the top logprobs for this token.
ResponseMessage: Chat completion response message.
SamplingParams: Sampling params are used to control sampling.
SearchFunctionParameters
SearchResult
SpeculativeConfig: Metadata for a speculative pipeline
SpeculativeLoader: A loader for a speculative pipeline using 2 Loaders.
SpeculativePipeline: Speculative decoding pipeline: https://arxiv.org/pdf/2211.17192
SpeechLoader
SpeechModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
SpeechPipeline
Starcoder2Loader: NormalLoader for a Starcoder2 model.
Tensor: The core struct for manipulating tensors.
TextMessages: Plain text (chat) messages.
TextModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
TextSpeculativeBuilder
TokenizationRequest: Request to tokenize some messages or some text.
Tool: Tool definition
ToolCallResponse
ToolCallbackWithTool: A tool callback with its associated Tool definition.
TopLogprob: Top-n logprobs element
Topology
UqffTextModelBuilder: Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors. This wraps and implements DerefMut for the TextModelBuilder, so users should take care to not call UQFF-related methods.
UqffVisionModelBuilder: Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors. This wraps and implements DerefMut for the VisionModelBuilder, so users should take care to not call UQFF-related methods.
Usage: OpenAI compatible (superset) usage during a request.
VisionLoader: A loader for a vision (non-quantized) model.
VisionLoaderBuilder: A builder for a loader for a vision (non-quantized) model.
VisionMessages: Text (chat) messages with images and/or audios.
VisionModelBuilder: Configure a vision model with the various parameters for loading, running, and other inference behaviors.
VisionSpecificConfig: Config specific to loading a vision model.
WebSearchOptions
XLoraModelBuilder: Wrapper of TextModelBuilder for X-LoRA models.

Enums§

AdapterPaths
AnyMoeExpertType
AutoDeviceMapParams
BertEmbeddingModel: Embedding model used for ranking web search results internally.
Constraint: Control the constraint with llguidance.
DType: The different types of elements allowed in tensors.
DefaultSchedulerMethod: The scheduler method controld how sequences are scheduled during each step of the engine. For each scheduling step, the scheduler method is used if there are not only running, only waiting sequences, or none. If is it used, then it is used to allow waiting sequences to run.
Device: Cpu, Cuda, or Metal
DeviceMapSetting
DiffusionLoaderType: The architecture to load the vision model as.
EngineInstruction
GGUFArchitecture
ImageGenerationResponseFormat: Image generation response format
IsqOrganization
IsqType
McpServerSource: Supported MCP server transport sources
MemoryGpuConfig
MistralRsError
ModelCategory: Category of the model. This can also be used to extract model-category specific tools, such as the vision model prompt prefixer.
ModelDType: DType for the model.
ModelKind: The kind of model to build.
ModelSelected
NormalLoaderType: The architecture to load the normal model as.
PagedCacheType
Request: A request to the Engine, encapsulating the various parameters as well as the mpsc response Sender used to return the Response.
RequestMessage: Message or messages for a Request.
Response: The response enum contains 3 types of variants:
ResponseErr
ResponseOk
SchedulerConfig
SearchContextSize
SpeechGenerationConfig
SpeechLoaderType
StopTokens: Stop sequences or ids.
SupportedModality
TextMessageRole: A chat message role.
TokenSource: The source of the HF token.
ToolCallType
ToolChoice
ToolType: Type of tool
VisionLoaderType: The architecture to load the vision model as.
WebSearchUserLocation

Constants§

GGUF_MULTI_FILE_DELIMITER
MULTI_LORA_DELIMITER
SYSTEM_FINGERPRINT
UQFF_MULTI_FILE_DELIMITER

Statics§

ENGINE_INSTRUCTIONS: Engine instructions, per Engine (MistralRs) ID.
GLOBAL_HF_CACHE
TERMINATE_ALL_NEXT_STEP: Terminate all sequences on the next scheduling step. Be sure to reset this. This is a global flag for terminating all engines at once (e.g., Ctrl+C).

Traits§

CustomLogitsProcessor: Customizable logits processor.
Loader: The Loader trait abstracts the loading process. The primary entrypoint is the load_model method.
ModelPaths: ModelPaths abstracts the mechanism to get all necessary files for running a model. For example LocalModelPaths implements ModelPaths when all files are in the local file system.
MultimodalPromptPrefixer: Prepend a vision tag appropriate for the model to the prompt. Image indexing is assumed that start at 0.
Pipeline
RequestLike: A type which can be used as a chat request.
TryIntoDType: Type which can be converted to a DType

Functions§

best_device: Gets the best device, cpu, cuda if compiled with CUDA, or Metal
cross_entropy_loss: The cross-entropy loss.
get_auto_device_map_params
get_engine_terminate_flag: Get or create a termination flag for the current engine thread.
get_model_dtype
get_tgt_non_granular_index
get_toml_selected_model_device_map_params
get_toml_selected_model_dtype
initialize_logging: This should be called to initialize the debug flag and logging. This should not be called in mistralrs-core code due to Rust usage.
paged_attn_supported: true if built with CUDA (requires Unix) /Metal
parse_isq_value: Parse ISQ value.
reset_engine_terminate_flag: Reset termination flags for the current engine.
should_terminate_engine_sequences: Check if the current engine should terminate sequences.
using_flash_attn: true if built with the flash-attn or flash-attn-v3 features, false otherwise.

Type Aliases§

LlguidanceGrammar
MessageContent
Result
SearchCallback: Callback used to override how search results are gathered. The returned vector must be sorted in decreasing order of relevance.
ToolCallback: Callback used for custom tool functions. Receives the called function (name and JSON arguments) and returns the tool output as a string.
ToolCallbacks: Collection of callbacks keyed by tool name.

Crate mistralrsCopy item path