Expand description
This crate provides an asynchronous API to mistral.rs
.
To get started loading a model, check out the following builders:
TextModelBuilder
LoraModelBuilder
XLoraModelBuilder
GgufModelBuilder
GgufLoraModelBuilder
GgufXLoraModelBuilder
VisionModelBuilder
AnyMoeModelBuilder
Check out the v0_4_api
module for concise documentation of this, newer API.
§Example
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
§Streaming example
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, Response, TextMessageRole, TextMessages,
TextModelBuilder,
};
use mistralrs_core::{ChatCompletionChunkResponse, ChunkChoice, Delta};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let mut stream = model.stream_chat_request(messages).await?;
while let Some(chunk) = stream.next().await {
if let Response::Chunk(ChatCompletionChunkResponse { choices, .. }) = chunk {
if let Some(ChunkChoice {
delta:
Delta {
content: Some(content),
..
},
..
}) = choices.first()
{
print!("content");
};
}
}
Ok(())
}
Re-exports§
pub use v0_4_api::*;
Modules§
- distributed
- layers
- llguidance
- v0_
4_ api - This will be the API as of v0.4.0. Other APIs will not be deprecated, but moved into a module such as this one.
Structs§
- AnyMoe
Config - AnyMoe
Loader - AnyMoe
Pipeline - Approximate
User Location - Called
Function - Chat
Completion Chunk Response - Chat completion streaming request chunk.
- Chat
Completion Response - An OpenAI compatible chat completion response.
- Chat
Template - Template for chat models including bos/eos/unk as well as the chat template.
- Choice
- Chat completion choice.
- Chunk
Choice - Chat completion streaming chunk choice.
- Completion
Choice - Completion request choice.
- Completion
Chunk Choice - Chat completion streaming chunk choice.
- Completion
Chunk Response - Completion request choice.
- Completion
Response - An OpenAI compatible completion response.
- Delta
- Delta in content for streaming response.
- Detokenization
Request - Request to detokenize some text.
- Device
Layer MapMetadata - Device
MapMetadata - Metadata to initialize the device mapper.
- Diffusion
Generation Params - Diffusion
Loader - A loader for a vision (non-quantized) model.
- Diffusion
Loader Builder - A builder for a loader for a vision (non-quantized) model.
- Diffusion
Specific Config - Config specific to loading a vision model.
- DrySampling
Params - Function
- GGML
Loader - A loader for a GGML model.
- GGML
Loader Builder - A builder for a GGML loader.
- GGML
Specific Config - Config for a GGML loader.
- GGUF
Loader - Loader for a GGUF model.
- GGUF
Loader Builder - A builder for a GGUF loader.
- GGUF
Specific Config - Config for a GGUF loader.
- Gemma
Loader NormalLoader
for a Gemma model.- Idefics2
Loader VisionLoader
for an Idefics 2 Vision model.- Image
Choice - Image
Generation Response - LLaVA
Loader VisionLoader
for an LLaVA Vision model.- LLaVA
Next Loader VisionLoader
for an LLaVANext Vision model.- Layer
Device Mapper - A device mapper which does device mapping per hidden layer.
- Layer
Topology - Llama
Loader NormalLoader
for a Llama model.- Loader
Builder - A builder for a loader using the selected model.
- Local
Model Paths - All local paths and metadata necessary to load a model.
- Logprobs
- Logprobs per token.
- Memory
Usage - Mistral
Loader - Mistral
Rs - The MistralRs struct handles sending requests to the engine.
It is the core multi-threaded component of mistral.rs, and uses
mpsc
Sender
andReceiver
primitives to send and receive requests to the engine. - Mistral
RsBuilder - The MistralRsBuilder takes the pipeline and a scheduler method and constructs an Engine and a MistralRs instance. The Engine runs on a separate thread, and the MistralRs instance stays on the calling thread.
- Mistral
RsConfig - Mixtral
Loader - Normal
Loader - A loader for a “normal” (non-quantized) model.
- Normal
Loader Builder - A builder for a loader for a “normal” (non-quantized) model.
- Normal
Request - A normal request request to the
MistralRs
. - Normal
Specific Config - Config specific to loading a normal model.
- Ordering
- Adapter model ordering information.
- Paged
Attention Config - All memory counts in MB. Default for block size is 32.
- Phi2
Loader NormalLoader
for a Phi 2 model.- Phi3
Loader NormalLoader
for a Phi 3 model.- Phi3V
Loader VisionLoader
for a Phi 3 Vision model.- Qwen2
Loader NormalLoader
for a Qwen 2 model.- Response
Logprob - A logprob with the top logprobs for this token.
- Response
Message - Chat completion response message.
- Sampling
Params - Sampling params are used to control sampling.
- Speculative
Config - Metadata for a speculative pipeline
- Speculative
Loader - A loader for a speculative pipeline using 2
Loader
s. - Speculative
Pipeline - Speculative decoding pipeline: https://arxiv.org/pdf/2211.17192
- Starcoder2
Loader NormalLoader
for a Starcoder2 model.- Tensor
- The core struct for manipulating tensors.
- Tokenization
Request - Request to tokenize some messages or some text.
- Tool
- Tool
Call Response - TopLogprob
- Top-n logprobs element
- Topology
- Usage
- OpenAI compatible (superset) usage during a request.
- Vision
Loader - A loader for a vision (non-quantized) model.
- Vision
Loader Builder - A builder for a loader for a vision (non-quantized) model.
- Vision
Specific Config - Config specific to loading a vision model.
- WebSearch
Options
Enums§
- AnyMoe
Expert Type - Auto
Device MapParams - Bert
Embedding Model - Embedding model used for ranking web search results internally.
- Constraint
- Control the constraint with llguidance.
- DType
- The different types of elements allowed in tensors.
- Default
Scheduler Method - The scheduler method controld how sequences are scheduled during each step of the engine. For each scheduling step, the scheduler method is used if there are not only running, only waiting sequences, or none. If is it used, then it is used to allow waiting sequences to run.
- Device
- Device
MapSetting - Diffusion
Loader Type - The architecture to load the vision model as.
- Engine
Instruction - GGUF
Architecture - Image
Generation Response Format - Image generation response format
- IsqOrganization
- IsqType
- Memory
GpuConfig - Mistral
RsError - Model
Category - Category of the model. This can also be used to extract model-category specific tools, such as the vision model prompt prefixer.
- ModelD
Type - DType for the model.
- Model
Kind - The kind of model to build.
- Model
Selected - Normal
Loader Type - The architecture to load the normal model as.
- Request
- A request to the Engine, encapsulating the various parameters as well as
the
mpsc
responseSender
used to return theResponse
. - Request
Message - Message or messages for a
Request
. - Response
- The response enum contains 3 types of variants:
- Response
Err - Response
Ok - Scheduler
Config - Stop
Tokens - Stop sequences or ids.
- Token
Source - The source of the HF token.
- Tool
Call Type - Tool
Choice - Tool
Type - Vision
Loader Type - The architecture to load the vision model as.
- WebSearch
User Location
Constants§
Statics§
- ENGINE_
INSTRUCTIONS - Engine instructions, per Engine (MistralRs) ID.
- GLOBAL_
HF_ CACHE - TERMINATE_
ALL_ NEXT_ STEP - Terminate all sequences on the next scheduling step. Be sure to reset this.
Traits§
- Custom
Logits Processor - Customizable logits processor.
- Loader
- The
Loader
trait abstracts the loading process. The primary entrypoint is theload_model
method. - Model
Paths ModelPaths
abstracts the mechanism to get all necessary files for running a model. For exampleLocalModelPaths
implementsModelPaths
when all files are in the local file system.- Pipeline
- TryIntoD
Type - Type which can be converted to a DType
- Vision
Prompt Prefixer - Prepend a vision tag appropriate for the model to the prompt. Image indexing is assumed that start at
Functions§
- cross_
entropy_ loss - The cross-entropy loss.
- get_
auto_ device_ map_ params - get_
model_ dtype - get_
tgt_ non_ granular_ index - get_
toml_ selected_ model_ device_ map_ params - get_
toml_ selected_ model_ dtype - initialize_
logging - This should be called to initialize the debug flag and logging. This should not be called in mistralrs-core code due to Rust usage.
- paged_
attn_ supported true
if built with CUDA (requires Unix) /Metal- parse_
isq_ value - Parse ISQ value.
- using_
flash_ attn true
if built with theflash-attn
orflash-attn-v3
features, false otherwise.