[feat] [AutoDeploy] Llama-4 Support #4163

lucaslie · 2025-05-08T18:33:54Z

Description

Updated/New Factory to differentiate between the different AutoModel APIs in HF
Adding support for Llama-4 Scout
Llama-4 Maverick Support
Updated Readme
Updated unit+integration tests

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copilot

Pull Request Overview

This PR adds support for Llama-4 within AutoDeploy by introducing a new BMM sharding transformation, updating model factory configurations, and refining related logging and flashinfer attention buffer sizes. Key changes include:

Introducing the bmm_shard transformation and its associated unit tests.
Updating logging levels in node_utils and modifying model factory values in several modules.
Adjusting input processing in the demo interface and updating configuration defaults for examples.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py	New unit tests for the BMM sharding transformation.
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py	Downgraded logging of residual node warnings from warning to debug.
tensorrt_llm/_torch/auto_deploy/transformations/transform.py	Added call to bmm_shard as part of the transformation pipeline.
tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py	Introduced the bmm_shard function for sharding batched matrix multiplications.
tensorrt_llm/_torch/auto_deploy/shim/interface.py	Updated default model_factory parameter for AutoDeployConfig.
tensorrt_llm/_torch/auto_deploy/shim/demollm.py	Changed call to create_input_processor to pass None instead of model.
tensorrt_llm/_torch/auto_deploy/models/hf.py	Modified factory registration and heuristics for disabling use_cache.
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py	Increased workspace buffer allocation from 128MB to 320MB with updated notes.
examples/auto_deploy/simple_config.py, .vscode/settings.json, .vscode/launch.json	Updated configuration defaults and VSCode settings for the new changes.

Comments suppressed due to low confidence (3)

tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py:68

Consider adding tests for the scenario where the BMM batch size is not evenly divisible by world_size to ensure the transformation correctly handles uneven distributions.

run_test(

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py:297

[nitpick] Verify that downgrading the log level from warning to debug does not mask potentially critical issues when residual nodes have more than two users.

ad_logger.debug(f"Unexpected # of users for residuals: {res_nodes_more_users}")

tensorrt_llm/_torch/auto_deploy/shim/demollm.py:375

[nitpick] Confirm that 'create_input_processor' gracefully handles a None value for the model parameter; if not, consider adding explicit handling or documentation for this case.

self.input_processor = create_input_processor(None, self.tokenizer)

examples/auto_deploy/simple_config.py

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

lucaslie · 2025-05-12T18:55:03Z

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

+
+    assert isinstance(gm, GraphModule), "Expecting GraphModule"
+
+    def handle_tensor(


@meenchen, here is an example of bmm sharding where either tensor can be a weight or activation tensor

lucaslie · 2025-05-13T02:43:38Z

TODO: only claim torch-compile in the support matrix, not torch-opt

lucaslie · 2025-05-14T16:27:23Z

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-PyTorch-[Post-Merge]"

tensorrt-cicd · 2025-05-14T16:33:09Z

PR_Github #5199 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-14T22:34:55Z

PR_Github #5199 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3795 completed with status: 'FAILURE'

lucaslie · 2025-05-15T02:10:37Z

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-PyTorch-[Post-Merge]"

tensorrt-cicd · 2025-05-15T02:23:07Z

PR_Github #5242 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-15T06:10:47Z

PR_Github #5242 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3831 completed with status: 'FAILURE'

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

lucaslie · 2025-05-23T03:43:21Z

will be merged via https://github.com/nv-auto-deploy/TensorRT-LLM/pull/32

lucaslie requested review from suyoggupta, sugunav14 and Fridah-nv May 8, 2025 18:33

lucaslie self-assigned this May 8, 2025

lucaslie requested a review from Copilot May 8, 2025 18:41

Copilot AI reviewed May 8, 2025

View reviewed changes

nvkgoyal reviewed May 9, 2025

View reviewed changes

examples/auto_deploy/simple_config.py Outdated Show resolved Hide resolved

nvkgoyal reviewed May 9, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

nvkgoyal reviewed May 9, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Show resolved Hide resolved

lucaslie force-pushed the ll/llama4 branch from c291c23 to 8bedb75 Compare May 9, 2025 18:35

lucaslie commented May 12, 2025

View reviewed changes

lucaslie force-pushed the ll/llama4 branch from 1ecdc42 to 510adaa Compare May 12, 2025 20:54

lucaslie force-pushed the ll/llama4 branch 2 times, most recently from 23401ac to b4306e0 Compare May 14, 2025 16:19

lucaslie mentioned this pull request May 14, 2025

vibe-coded BMM sharding nv-auto-deploy/TensorRT-LLM#1

Closed

lucaslie force-pushed the ll/llama4 branch from 0717090 to bee7063 Compare May 14, 2025 16:26

lucaslie added the AutoDeploy label May 14, 2025

lucaslie added 6 commits May 15, 2025 15:24

[AutoDeploy] bmm sharder

ea3a759

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] more robust handling of attention interface and input nodes

3768616

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] eager pattern matcher new pattern

7d86c7a

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] HF factory improvements

6e2f387

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] configurable cache resize

80fd2f2

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] proper process group clean up

93d2916

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

[AutoDeploy] Llama-4 support

e9cb68a

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

lucaslie force-pushed the ll/llama4 branch from 9183617 to e9cb68a Compare May 15, 2025 22:25

lucaslie closed this May 23, 2025

lucaslie mentioned this pull request May 23, 2025

[AutoDeploy] Llama-4 support + AutoModelForImageTextToText nv-auto-deploy/TensorRT-LLM#32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] [AutoDeploy] Llama-4 Support #4163

[feat] [AutoDeploy] Llama-4 Support #4163

Uh oh!

lucaslie commented May 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie May 12, 2025

Uh oh!

lucaslie commented May 13, 2025

Uh oh!

lucaslie commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

lucaslie commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

lucaslie commented May 23, 2025

Uh oh!

Uh oh!


		assert isinstance(gm, GraphModule), "Expecting GraphModule"

		def handle_tensor(

[feat] [AutoDeploy] Llama-4 Support #4163

[feat] [AutoDeploy] Llama-4 Support #4163

Uh oh!

Conversation

lucaslie commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie May 12, 2025

Choose a reason for hiding this comment

Uh oh!

lucaslie commented May 13, 2025

Uh oh!

lucaslie commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

lucaslie commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

lucaslie commented May 23, 2025

Uh oh!

Uh oh!

lucaslie commented May 8, 2025 •

edited

Loading