You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TensorRT Model Optimizer's goal is to provide a unified API library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.
In striving for this, our roadmap and development follow these product strategies:
Provide a one-stop-shop for SOTA optimization methods (quantization, distillation, sparsity, pruning, speculation, etc) with easy-to-use APIs for developers to chain different methods with reproducibility.
Provide transparency and extensibility, making it easy for developers and researchers to innovate and contribute.
Provide the best easy-to-use recipes in the ecosystem through software-hardware co-design on NVIDIA platforms. Since Model-Optimizer's launch, we’ve been delivering 50% to ~5x speedup on top of existing runtime and compiler optimizations on NVIDIA GPUs with minimal impact on model accuracy (Latest News).
Tightly integrate into the Deep Learning inference and training ecosystem, beyond NVIDIA’s in-house stacks. Offer many-to-many optimizations by supporting popular frameworks like vLLM, SGLang, TensorRT-LLM, TensorRT, NeMo, and Hugging Face.
In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features. We welcome any questions and feedback in this thread and feature requests in github Issues 😊.
Overview:
Upcoming releases
We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.
ModelOpt v0.27 (Apr 2025)
ModelOpt v0.29 (May 2025)
ModelOpt v0.31 (June 2025)
ModelOpt v0.33 (July 2025)
Goals
* FP4 mixed precision for improved accuracy capabilities * DeepSeek-R1 Speculation for ~2x perf improvement * Improved fake quantization performance
* Expand FP4 ready-to-use checkpoints published * FP4 support for ONNX deployment path * Continue adoption of FP4 with vLLM/SGLang
* FP4 mixed precision Qserve/AWQ for improved accuracy * Improved Spec Decoding Training Flow * Expanded diffusion optimizations * AutoQuant Improved Customizability * FP4 Extended model support
* Improved Distillation Training Flow * Improved diffusion cache usability * Expanded ONNX Quant support * AutoQuant Improved Customizability * FP4 Extended model support
Advanced PTQ methods, QAT with distillation, Attention Sparsity, token-efficient pruning/distillation, general rotation/smoothing.
Easier E2E Speculation/Distillation/QAT.
Improve caching techniques.
Host pre-optimized checkpoints on HuggingFace.
3. Developer Productivity
Open sourced for everyone with improved extensibility, transparency, debuggability and accessibility.
Docs/code improvements to assist community to expand on their use cases, experiment and introduce custom optimizations.
Ready-to-deploy optimized checkpoints for both ease of use and resource limited developers.
4. Choice of Deployment
Support for TRT-LLM, vLLM and SGLang expanding.
In-framework deployment for quick prototyping.
5. Expand Support Matrix
Expand data-type availability.
Continuously expand our model support based on community interests.
Continued Auto and Windows support
Details:
1. FP4 inference on NVIDIA Blackwell
NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:
For the majority of developers, Model-Optimizer offers Post Training Quantization (PTQ) (weight and activation, weight-only) and our proprietary AutoQuantize for FP4 inference. AutoQuantize automates per-layer quantization formats to achieve minimal model accuracy loss.
Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:
AutoQuantize improvements (e.g., support more fine-trained format selection and various weight and activation combination)
New token-efficient pruning and distillation methods
Infrastructure to support general rotation and smoothing
2.2 Optimized techniques for LLM and VLM
Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:
(Speculation) Integrated draft model: Medusa, Redrafter, MTP and EAGLE.
(Speculation/Distillation) Standalone draft model training through pruning and knowledge distillation.
(Distillation) Standalone model shrinking/compressing through pruning and knowledge distillation (e.g. Llama-3.2 1/3B).
(Quantization) Quantization aware training with support of FP8 and FP4.
Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.
Hosting pre-optimized checkpoints for popular models such as DeepSeek-R1, Llama-3.1, Llama-3.3 and Nemotron family on Hugging Face Model-Optimizer collection.
2.3 Optimized techniques for diffusers
Model-Optimizer will continue to accelerate image generation inference by investing in these areas:
Quantization: Expand model support for INT8/FP8/FP4 PTQ and QAT. e.g., FLUX model series.
Caching: Adding more training-free and lightweight finetuning-based caching techniques with user-friendly APIs. (Previous work: Cache Diffusion).
Improve easy of use of the deployment pipelines, including adding multi-GPU support.
3. Developer Productivity
3.1 Open-sourcing
To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.
3.2 Ready-to-deploy optimized checkpoints
For developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.
4. Choice of Deployment
4.1 Popular Community Frameworks
To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.
4.2 In-Framework Deployment
We have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:
When optimizing inference performance or exploring new model compression techniques, Model-Optimizer users can quickly prototype in the PyTorch runtime and native PyTorch APIs to evaluate performance gains. Once satisfied, they can transition to the TensorRT-LLM runtime as the final step to maximize performance.
For models not yet supported by TensorRT-LLM or applications that do not need ultra-fast inference speeds, users can get out-of-the-box performance improvements within native PyTorch.
Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.
5.2 Model Support
We strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.
5.3 Platform & Other Support
Model-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.
In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.
The text was updated successfully, but these errors were encountered:
TensorRT Model Optimizer - Product Roadmap
TensorRT Model Optimizer's goal is to provide a unified API library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.
In striving for this, our roadmap and development follow these product strategies:
In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features. We welcome any questions and feedback in this thread and feature requests in github Issues 😊.
Overview:
Upcoming releases
We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.
* DeepSeek-R1 Speculation for ~2x perf improvement
* Improved fake quantization performance
* FP4 support for ONNX deployment path
* Continue adoption of FP4 with vLLM/SGLang
* Improved Spec Decoding Training Flow
* Expanded diffusion optimizations
* AutoQuant Improved Customizability
* FP4 Extended model support
* Improved diffusion cache usability
* Expanded ONNX Quant support
* AutoQuant Improved Customizability
* FP4 Extended model support
* PTQ/QAT FP8 Blockwise
* FP4 KV cache quantization
* SVDQuant
* Eagle 3 Support
* Improved Speculative Decoding Training Flow (NeMo)
* Sage Attention Support
* Int8 smoothquant
* ONNX Mixed Precision
* Additional OSS documentation
* More frequent OSS codebase updates
* Diffusion cache API
* Llama 4 Scout FP8/FP4
* Llama 4 Maverick FP8/FP4
* Llama-Nemotron Super/Ultra FP8
* Mistral Large FP4
* ONNX Llama 3 FP4
* ONNX Qwen2.5 FP4
* Llama 3 FP4 support SGLang
* Quantize Llama 4 Scout/Maverick FP4 support SGLang
* QwQ-32B FP4 support vLLM
* QwQ-32B FP4 support SGLang
* Qwen 2.5 FP4 support SGLang
* Mixtral FP4 support vLLM
* Mixtral FP4 support SGLang
1. FP4 inference on NVIDIA Blackwell
2. Model optimization techniques
3. Developer Productivity
4. Choice of Deployment
5. Expand Support Matrix
Details:
1. FP4 inference on NVIDIA Blackwell
NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:
2. Model optimization techniques
2.1 Model compression algorithms
Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:
2.2 Optimized techniques for LLM and VLM
Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:
2.3 Optimized techniques for diffusers
Model-Optimizer will continue to accelerate image generation inference by investing in these areas:
3. Developer Productivity
3.1 Open-sourcing
To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.
3.2 Ready-to-deploy optimized checkpoints
For developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.
4. Choice of Deployment
4.1 Popular Community Frameworks
To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.
4.2 In-Framework Deployment
We have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:
Developers can utilize AutoDeploy or Real Quantization for these in-framework deployments.
5. Expand Support Matrix
5.1 Data types
Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.
5.2 Model Support
We strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.
5.3 Platform & Other Support
Model-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.
In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.
The text was updated successfully, but these errors were encountered: