Expand description
This crate provides an asynchronous API to mistral.rs
.
To get started loading a model, check out the following builders:
TextModelBuilder
LoraModelBuilder
XLoraModelBuilder
GgufModelBuilder
GgufLoraModelBuilder
GgufXLoraModelBuilder
VisionModelBuilder
AnyMoeModelBuilder
Check out the v0_4_api
module for concise documentation of this, newer API.
§Example
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
§Streaming example
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder, Response
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let mut stream = model.stream_chat_request(messages).await?;
while let Some(chunk) = stream.next().await {
if let Response::Chunk(chunk) = chunk{
print!("{}", chunk.choices[0].delta.content);
}
// Handle the error cases.
}
Ok(())
}
Re-exports§
pub use v0_4_api::*;
Modules§
- This will be the API as of v0.4.0. Other APIs will not be deprecated, but moved into a module such as this one.
Structs§
- Chat completion streaming request chunk.
- An OpenAI compatible chat completion response.
- Template for chat models including bos/eos/unk as well as the chat template.
- Chat completion choice.
- Chat completion streaming chunk choice.
- Completion request choice.
- Chat completion streaming chunk choice.
- Completion request choice.
- An OpenAI compatible completion response.
- Delta in content for streaming response.
- Request to detokenize some text.
- Metadata to initialize the device mapper.
- A loader for a vision (non-quantized) model.
- A builder for a loader for a vision (non-quantized) model.
- Config specific to loading a vision model.
- A loader for a GGML model.
- A builder for a GGML loader.
- Config for a GGML loader.
- Loader for a GGUF model.
- A builder for a GGUF loader.
- Config for a GGUF loader.
NormalLoader
for a Gemma model.VisionLoader
for an Idefics 2 Vision model.VisionLoader
for an LLaVA Vision model.VisionLoader
for an LLaVANext Vision model.- A device mapper which does device mapping per hidden layer.
NormalLoader
for a Llama model.- A builder for a loader using the selected model.
- All local paths and metadata necessary to load a model.
- Logprobs per token.
- The MistralRs struct handles sending requests to the engine. It is the core multi-threaded component of mistral.rs, and uses
mpsc
Sender
andReceiver
primitives to send and receive requests to the engine. - The MistralRsBuilder takes the pipeline and a scheduler method and constructs an Engine and a MistralRs instance. The Engine runs on a separate thread, and the MistralRs instance stays on the calling thread.
- A loader for a “normal” (non-quantized) model.
- A builder for a loader for a “normal” (non-quantized) model.
- A normal request request to the
MistralRs
. - Config specific to loading a normal model.
- Adapter model ordering information.
- All memory counts in MB. Default for block size is 32.
NormalLoader
for a Phi 2 model.NormalLoader
for a Phi 3 model.VisionLoader
for a Phi 3 Vision model.NormalLoader
for a Qwen 2 model.- A logprob with the top logprobs for this token.
- Chat completion response message.
- Sampling params are used to control sampling.
- Metadata for a speculative pipeline
- A loader for a speculative pipeline using 2
Loader
s. - Speculative decoding pipeline: https://arxiv.org/pdf/2211.17192
NormalLoader
for a Starcoder2 model.- The core struct for manipulating tensors.
- Request to tokenize some messages or some text.
- Top-n logprobs element
- OpenAI compatible (superset) usage during a request.
- A loader for a vision (non-quantized) model.
- A builder for a loader for a vision (non-quantized) model.
- Config specific to loading a vision model.
Enums§
- Control the constraint with llguidance.
- The different types of elements allowed in tensors.
- The scheduler method controld how sequences are scheduled during each step of the engine. For each scheduling step, the scheduler method is used if there are not only running, only waiting sequences, or none. If is it used, then it is used to allow waiting sequences to run.
- The architecture to load the vision model as.
- Image generation response format
- Category of the model. This can also be used to extract model-category specific tools, such as the vision model prompt prefixer.
- DType for the model.
- The kind of model to build.
- The architecture to load the normal model as.
- A request to the Engine, encapsulating the various parameters as well as the
mpsc
responseSender
used to return theResponse
. - Message or messages for a
Request
. - The response enum contains 3 types of variants:
- Stop sequences or ids.
- The source of the HF token.
- The architecture to load the vision model as.
Constants§
Statics§
- Engine instructions, per Engine (MistralRs) ID.
- Terminate all sequences on the next scheduling step. Be sure to reset this.
Traits§
- Customizable logits processor.
- The
Loader
trait abstracts the loading process. The primary entrypoint is theload_model
method. ModelPaths
abstracts the mechanism to get all necessary files for running a model. For exampleLocalModelPaths
implementsModelPaths
when all files are in the local file system.- Type which can be converted to a DType
- Prepend a vision tag appropriate for the model to the prompt. Image indexing is assumed that start at
Functions§
- The cross-entropy loss.
- This should be called to initialize the debug flag and logging. This should not be called in mistralrs-core code due to Rust usage.
- Parse ISQ value: one of