Skip to content

Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) #4195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented May 9, 2025

Description

Expand release-perf-regession-testing converage of llama_v3.1_nemotron_nano_8b to map to NIM benchmarking configs only for the cpp backend (as the PyT backend seems to be having issues that need further debugging).


What’s inside

Cross product of the following:

- model: llama_v3.1_nemotron_nano_8b
- runtime: bench
- backend: [trt]
- con: [1, 250] # concurrency
- input_output_len:
  - [5000, 500]
  - [500, 2000]
  - [1000, 1000]
  - [20000, 2000]
- quant: [none, fp8]
- dtype: [bfloat16]
- maxbs: [64]

Total: 16 newly added cpp-backend perf teststests/integration/test_lists/qa/trt_llm_release_perf_test.yml, removing previous cpp-backend perf tests, preserving the pyt-backend perf tests.

📊 Performance Benchmark Summary (subset)

As sanity check, below is perf overview of subset (8) of the 32 cases - that span the TRT flow on 1 H100.

Input/Output Length Concurrency Precision Request Throughput (req/s) Output TPS (tokens/s) P50 Latency (ms)
5000 / 500 1 BF16 0.2312 115.60 4328.22
5000 / 500 1 FP8 0.4163 208.13 2397.91
500 / 2000 1 BF16 0.0616 123.26 16203.74
500 / 2000 1 FP8 0.1051 210.18 9516.47
5000 / 500 250 BF16 2.0712 1035.62 117121.77
5000 / 500 250 FP8 3.6120 1805.98 67227.43
500 / 2000 250 BF16 0.8633 1726.57 278001.38
500 / 2000 250 FP8 1.4663 2932.65 163683.41

🧵 Observations

  • FP8 consistently improves both request throughput and token throughput across all configurations, with up to 2× higher output TPS in high concurrency scenarios.
  • Latency (P50) significantly reduces with FP8, especially under low concurrency, showing strong inference-time gains.
  • Concurrency scaling: Increasing concurrency to 250 leads to significantly higher overall throughput, though with expected increases in tail latencies.

📂 Dataset Details

Each configuration used synthetic datasets generated with consistent parameters (512 sequences per run) using the script in TensorRT-LLM/benchmarks/cpp/prepare_dataset.py.

@venkywonka venkywonka requested a review from Copilot May 9, 2025 17:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the performance integration tests for llama_v3.1_nemotron_nano_8b so that the test cases reflect the updated NIM benchmarking configurations.

  • Removed previous torch and TRT backend tests
  • Added 32 new test scenarios covering both C++ (cpp) and Python (pyt) backends with various precision, batch, and sequence length configurations

@venkywonka venkywonka marked this pull request as ready for review May 9, 2025 17:06
@venkywonka venkywonka changed the title Extend the Llama-Nemotron-Nano-8B perf-integration-tests Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) May 12, 2025
@venkywonka
Copy link
Collaborator Author

I'm locally seeing issue with the above configurations when using pyt backend, hence limiting this PR to only cpp backend. Don't want to check-in tests that break.

@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-perf-test-ext branch from 42fdc1f to cd3cbdc Compare May 12, 2025 16:08
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4892 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4892 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3544 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4901 [ run ] triggered by Bot

@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4904 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4901 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4904 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3556 completed with status: 'FAILURE'

@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-perf-test-ext branch from cd3cbdc to 04514c0 Compare May 13, 2025 01:50
@venkywonka venkywonka requested review from schetlur-nv and ruodil May 13, 2025 16:16
@venkywonka
Copy link
Collaborator Author

@LarryXFly @kaiyux can this be merged?

@venkywonka venkywonka requested a review from kaiyux May 14, 2025 14:02
@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-perf-test-ext branch 2 times, most recently from 79c8384 to a860905 Compare May 15, 2025 13:09
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5348 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5348 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3901 completed with status: 'FAILURE'

@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-perf-test-ext branch from a860905 to 0aac105 Compare May 15, 2025 20:58
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
- When validating the pytorch tests with the isl/osl/conc/quant settings (that is done for cpp backend too), seeing hangs that need further debugging.
- Therefore don't want to block this PR, hence removing them.
- Seeing

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-perf-test-ext branch from 0aac105 to 74ff4d8 Compare May 16, 2025 19:10
@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5531 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5531 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4032 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5545 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5545 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4046 completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator

/bot skip --comment "Irrelevant TOT failures"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5573 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5573 [ skip ] completed with state SUCCESS
Skipping testing for commit 74ff4d8

@chzblych chzblych merged commit fb663b6 into NVIDIA:main May 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants