Matformer (Matryoshka Transformer) Support
Matformer allows you to dynamically resize transformer models at runtime, trading compute/memory for quality. This enables deploying the same model across devices with different resource constraints - from edge devices to powerful GPUs.
Quick Start
Command Line
# Run Gemma 3n with the E2.49B configuration (2.49B params instead of 3.98B)
mistralrs run -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
Python
from mistralrs import Runner, Which, VisionArchitecture
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
)
Rust
#![allow(unused)]
fn main() {
use mistralrs::VisionModelBuilder;
use std::path::PathBuf;
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.build()
.await?;
}
How It Works
Matformer models are pre-trained with a special architecture that allows certain layers to be skipped at inference time while maintaining reasonable quality. When you select a “slice”:
- Layer Skipping: Specified layers are completely removed from computation
- FFN Resizing: Feed-forward network dimensions can be adjusted per layer
- Automatic Remapping: Remaining layers are renumbered sequentially
For example, the Gemma 3n E2.49B (block-level) slice:
- Keeps all 35 layers (no layer skipping)
- Uses mixed FFN dimensions: 8192 for layers 0-19, 16384 for layers 20-24, 8192 for layers 25-34
- Cuts parameters from 3.98B to 2.49B (~37% reduction)
- Maintains ~87% of the full model’s quality
Configuration Files
Matformer configurations are CSV files with these columns:
name,# Layers,# Effective Params (B),MMLU PT accuracy,FFN Hidden Dims,Layers Skipped
Main model,35,3.98,62.30%,"[16384, 16384, ...]",
Config for E2.49B (block-level),35,2.49,54.50%,"[8192, 8192, ..., 16384, 16384, ..., 8192, 8192, ...]",
- name: Slice identifier used in
matformer_slice_name - # Layers: Number of active layers after skipping
- # Effective Params (B): Approximate parameter count in billions
- MMLU PT accuracy: Benchmark score (informational)
- FFN Hidden Dims: List of FFN dimensions for each layer
- Layers Skipped: Which layers to remove (0-indexed)
Supported Models
Currently supported:
- Gemma 3n (
google/gemma-3n-E4B-it) - Multimodal model with vision and audio
See matformer_configs/ for available configurations.
Performance Guide
Memory Usage
Memory scales approximately with parameter count:
- Full model (3.98B): ~8GB VRAM
- E2.49B slice: ~5GB VRAM
- E2B slice (1.91B): ~4GB VRAM
- Smaller slices: Proportionally less
Inference Speed
Speed improvement is roughly linear with layer count:
- 30 layers vs 35 layers = ~14% faster
- 20 layers vs 35 layers = ~43% faster
Quality Trade-offs
Example accuracy on MMLU benchmark:
- Full model: 62.3%
- E2.98B: 59.5% (-4.5%)
- E2.49B: 54.5% (-12.5%)
- E2B: 50.9% (-18.3%)
Choose based on your requirements:
- Maximum quality: Use full model (omit matformer args)
- Balanced: E2.49B to E2.98B configurations (block-level configs recommended)
- Resource-constrained: E2B configuration (1.91B params)
- Extreme efficiency: E1.96B configuration
Advanced Usage
With Quantization
Combine Matformer with ISQ for maximum efficiency:
runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3n-E4B-it",
arch=VisionArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
in_situ_quant="Q4K" # 4-bit quantization
)
With Device Mapping
Matformer works seamlessly with automatic device mapping:
#![allow(unused)]
fn main() {
use mistralrs::{VisionModelBuilder, DeviceMapSetting, AutoDeviceMapParams};
let model = VisionModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.with_device_mapping(DeviceMapSetting::Auto(
AutoDeviceMapParams::default_vision()
))
.build()
.await?;
}
Only active layers are loaded to GPU, saving memory.
Creating Custom Configurations
To create your own Matformer configuration:
- Start with the full model as baseline
- Identify skippable layers:
- Middle layers (10-30) are often good candidates
- Avoid early layers (feature extraction) and late layers (final representations)
- Never skip special layers (KV-sharing, attention patterns)
- Test quality degradation at each configuration
- Create CSV file with your configurations
Example minimal configuration:
name,# Layers,# Effective Params (B),FFN Hidden Dims,Layers Skipped
Tiny,15,0.8,"[4096, 4096, ...]","[5,6,7,10,11,12,15,16,17,20,21,22,25,26,27,30,31,32,33,34]"
API Reference
Command Line Arguments
--matformer-config-path PATH: Path to CSV configuration file--matformer-slice-name NAME: Exact name of slice from CSV
Python Parameters
Which.VisionPlain(
model_id: str,
arch: VisionArchitecture,
matformer_config_path: str = None, # Path to CSV
matformer_slice_name: str = None, # Slice name
# ... other parameters
)
Rust Methods
#![allow(unused)]
fn main() {
// For VisionModelBuilder
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
// For TextModelBuilder (when supported)
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
}
Troubleshooting
Common Issues
“Matformer slice ‘X’ not found”
- Check slice name matches exactly (case-sensitive)
- Verify CSV file path is correct
“Layers X and Y are reserved and cannot be skipped”
- Some models have special layers that must not be skipped
- Try different layer combinations
Memory not reduced as expected
- Ensure you’re using the slice (check logs)
- Skipped layers still need to be loaded initially
- Consider combining with quantization
Debugging
Enable logging to see Matformer details:
RUST_LOG=mistralrs_core=info mistralrs ...
This shows:
- Configuration file loaded
- Selected slice details
- Layers being skipped
- Final layer count
Future Plans
- Support for more model architectures
- Dynamic slice switching during runtime
- Automatic slice selection based on available resources
- Fine-tuning tools for creating new Matformer models