Skip to content

[RFC] TensorRT Model Optimizer - Product Roadmap #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
omrialmog opened this issue Mar 6, 2025 · 0 comments
Open

[RFC] TensorRT Model Optimizer - Product Roadmap #146

omrialmog opened this issue Mar 6, 2025 · 0 comments
Assignees
Labels

Comments

@omrialmog
Copy link
Collaborator

omrialmog commented Mar 6, 2025

TensorRT Model Optimizer - Product Roadmap

TensorRT Model Optimizer's goal is to provide a unified API library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.

In striving for this, our roadmap and development follow these product strategies:

  1. Provide a one-stop-shop for SOTA optimization methods (quantization, distillation, sparsity, pruning, speculation, etc) with easy-to-use APIs for developers to chain different methods with reproducibility.
  2. Provide transparency and extensibility, making it easy for developers and researchers to innovate and contribute.
  3. Provide the best easy-to-use recipes in the ecosystem through software-hardware co-design on NVIDIA platforms. Since Model-Optimizer's launch, we’ve been delivering 50% to ~5x speedup on top of existing runtime and compiler optimizations on NVIDIA GPUs with minimal impact on model accuracy (Latest News).
  4. Tightly integrate into the Deep Learning inference and training ecosystem, beyond NVIDIA’s in-house stacks. Offer many-to-many optimizations by supporting popular frameworks like vLLM, SGLang, TensorRT-LLM, TensorRT, NeMo, and Hugging Face.

In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features. We welcome any questions and feedback in this thread and feature requests in github Issues 😊.

Overview:

Upcoming releases

We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.

  ModelOpt v0.27 (Apr 2025)​ ModelOpt v0.29 (May 2025)​ ModelOpt v0.31 (June 2025)​ ModelOpt v0.33 (July 2025)​
Goals * FP4 mixed precision for improved accuracy capabilities
* DeepSeek-R1 Speculation for ~2x perf improvement
* Improved fake quantization performance
* Expand FP4 ready-to-use checkpoints published
* FP4 support for ONNX deployment path
* Continue adoption of FP4 with vLLM/SGLang
* FP4 mixed precision Qserve/AWQ for improved accuracy
* Improved Spec Decoding Training Flow
* Expanded diffusion optimizations
* AutoQuant Improved Customizability
* FP4 Extended model support
* Improved Distillation Training Flow
* Improved diffusion cache usability
* Expanded ONNX Quant support
* AutoQuant Improved Customizability
* FP4 Extended model support
Feature Improvements​ * PTQ/QAT FP4/FP8 W4A8 recipe
* PTQ/QAT FP8 Blockwise
* FP4 KV cache quantization​​
* SVDQuant
* PTQ/QAT FP4 ONNX deployment
* Eagle 3 Support
* PTQ/QAT FP4/FP8 Mixed Precision Qserve/AWQ
* Improved Speculative Decoding Training Flow (NeMo)
* Sage Attention Support
* Int8 smoothquant
* Improved Distillation Training Flow (NeMo)
* ONNX Mixed Precision
Developer Productivity​   * MXFP8 ONNX support example
* Additional OSS documentation
* More frequent OSS codebase updates
* AutoQuant Custom Recipes * AutoQuant Sensitivity Analysis Standalone
* Diffusion cache API
Model Support​ * DeepSeek-R1 Spec Decoding * QwQ-32B FP4
* Llama 4 Scout FP8/FP4
* Llama 4 Maverick FP8/FP4
* Llama-Nemotron Super/Ultra FP8
* Mistral Small FP4
* Mistral Large FP4
* ONNX Llama 3 FP4
* Pixtral Large FP4
* ONNX Qwen2.5 FP4
Platform Support & Ecosystem * INT8/FP8 support for Windows * Llama 3 FP4 support vLLM
* Llama 3 FP4 support SGLang
* Quantize Llama 4 Scout/Maverick FP4 support vLLM
* Quantize Llama 4 Scout/Maverick FP4 support SGLang
* QwQ-32B FP4 support vLLM
* QwQ-32B FP4 support SGLang
* Qwen 2.5 FP4 support vLLM
* Qwen 2.5 FP4 support SGLang
* Mixtral FP4 support vLLM
* Mixtral FP4 support SGLang

1. FP4 inference on NVIDIA Blackwell

2. Model optimization techniques

  • Advanced PTQ methods, QAT with distillation, Attention Sparsity, token-efficient pruning/distillation, general rotation/smoothing.
  • Easier E2E Speculation/Distillation/QAT.
  • Improve caching techniques.
  • Host pre-optimized checkpoints on HuggingFace.

3. Developer Productivity

  • Open sourced for everyone with improved extensibility, transparency, debuggability and accessibility.
  • Docs/code improvements to assist community to expand on their use cases, experiment and introduce custom optimizations.
  • Ready-to-deploy optimized checkpoints for both ease of use and resource limited developers.

4. Choice of Deployment

  • Support for TRT-LLM, vLLM and SGLang expanding.
  • In-framework deployment for quick prototyping.

5. Expand Support Matrix

  • Expand data-type availability.
  • Continuously expand our model support based on community interests.
  • Continued Auto and Windows support

Details:

1. FP4 inference on NVIDIA Blackwell

NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:

  1. For the majority of developers, Model-Optimizer offers Post Training Quantization (PTQ) (weight and activation, weight-only) and our proprietary AutoQuantize for FP4 inference. AutoQuantize automates per-layer quantization formats to achieve minimal model accuracy loss.
  2. For developers who require lossless FP4 quantization, Model-Optimizer offers Quantization Aware Training (QAT), which makes the neural network more resilient to quantization. Model-Optimizer QAT already works with NVIDIA Megatron, NVIDIA NeMo, native PyTorch training, and Hugging Face Trainer.

2. Model optimization techniques

2.1 Model compression algorithms

Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:

  • Advanced PTQ methods (e.g., SVDQuant, QuaRot, SpinQuant)
  • QAT with distillation, a proven path for FP4 inference
  • Attention sparsity (e.g., SnapKV, DuoAttention)
  • AutoQuantize improvements (e.g., support more fine-trained format selection and various weight and activation combination)
  • New token-efficient pruning and distillation methods
  • Infrastructure to support general rotation and smoothing

2.2 Optimized techniques for LLM and VLM

Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:

  • (Speculation) Integrated draft model: Medusa, Redrafter, MTP and EAGLE.
  • (Speculation/Distillation) Standalone draft model training through pruning and knowledge distillation.
  • (Distillation) Standalone model shrinking/compressing through pruning and knowledge distillation (e.g. Llama-3.2 1/3B).
  • (Quantization) Quantization aware training with support of FP8 and FP4.
  • Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.
  • Hosting pre-optimized checkpoints for popular models such as DeepSeek-R1, Llama-3.1, Llama-3.3 and Nemotron family on Hugging Face Model-Optimizer collection.

2.3 Optimized techniques for diffusers

Model-Optimizer will continue to accelerate image generation inference by investing in these areas:

  • Quantization: Expand model support for INT8/FP8/FP4 PTQ and QAT. e.g., FLUX model series.
  • Caching: Adding more training-free and lightweight finetuning-based caching techniques with user-friendly APIs. (Previous work: Cache Diffusion).
  • Improve easy of use of the deployment pipelines, including adding multi-GPU support.

3. Developer Productivity

3.1 Open-sourcing

To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.

3.2 Ready-to-deploy optimized checkpoints

For developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.

4. Choice of Deployment

4.1 Popular Community Frameworks

To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.

4.2 In-Framework Deployment

We have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:

  1. When optimizing inference performance or exploring new model compression techniques, Model-Optimizer users can quickly prototype in the PyTorch runtime and native PyTorch APIs to evaluate performance gains. Once satisfied, they can transition to the TensorRT-LLM runtime as the final step to maximize performance.
  2. For models not yet supported by TensorRT-LLM or applications that do not need ultra-fast inference speeds, users can get out-of-the-box performance improvements within native PyTorch.

Developers can utilize AutoDeploy or Real Quantization for these in-framework deployments.

5. Expand Support Matrix

5.1 Data types

Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.

5.2 Model Support

We strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.

5.3 Platform & Other Support

Model-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.

In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant