[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

Chenyaaang · 2025-04-24T22:54:09Z

Add arg check for vllm serve subcommand to check user input --max-num-batched-tokens, --max-num-seqs and --max-model-len

github-actions · 2025-04-24T22:54:17Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

…_len to reduce chance of perf degradation Signed-off-by: Chenyaaang <chenyangli@google.com>

vanbasten23 · 2025-04-26T00:03:47Z

examples/online_serving/chart-helm/values.yaml

@@ -8,7 +8,7 @@ image:
  # -- Image tag
  tag: "latest"
  # -- Container launch command
-  command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000"]
+  command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000", "--max-num-batched-tokens", "2048", "--max-num-seqs", "16", "--max-model-len", "2048"]


Is it used by TPU?

thanks for the comment, it's run by lint on cpu. I modified it before I added platform check, I'll remove it.

yaochengji · 2025-04-26T01:10:44Z

vllm/entrypoints/openai/cli_args.py

+    # Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len
+    # are passed within command on TPU.
+    from vllm.platforms import current_platform
+    if current_platform.is_tpu():


We cannot put it in platform/tpu.py because the args are not None there. cc @NickLucche @alexm-redhat @mgoin

yaochengji · 2025-04-26T01:11:39Z

vllm/entrypoints/openai/cli_args.py

@@ -289,6 +289,19 @@ def validate_parsed_serve_args(args: argparse.Namespace):
        raise TypeError("Error: --enable-reasoning requires "
                        "--reasoning-parser")

+    # Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len


Could you add some description about why we want to make sure these arguments are passed?

mgoin

I don't understand how requiring these arguments to be set by the user reduces the chance for performance degradation. The user still needs to become an expert to figure out the right values in this case.

I think we should instead focus on improving the default parameters given knowledge about the hardware and model. See this section for an example we use to increase the default values when deploying on hardware with more memory available

vllm/vllm/engine/arg_utils.py

Lines 1609 to 1622 in 52b4f4a

    
           if device_memory >= 70 * GiB_bytes: 
        
               # For GPUs like H100 and MI300x, use larger default values. 
        
               default_max_num_batched_tokens = { 
        
                   UsageContext.LLM_CLASS: 16384, 
        
                   UsageContext.OPENAI_API_SERVER: 8192, 
        
               } 
        
               default_max_num_seqs = 1024 
        
           else: 
        
               # TODO(woosuk): Tune the default values for other hardware. 
        
               default_max_num_batched_tokens = { 
        
                   UsageContext.LLM_CLASS: 8192, 
        
                   UsageContext.OPENAI_API_SERVER: 2048, 
        
               } 
        
               default_max_num_seqs = 256

I propose we start a section for TPU here.

yaochengji · 2025-04-27T03:41:51Z

I don't understand how requiring these arguments to be set by the user reduces the chance for performance degradation.

I remember our idea was the default value of --max-num-batched-tokens, --max-num-seqs might be too large for TPU, which hurts compilation time. And max-model-len is used in kernel tuning, so it should be set to reflect the actual workload. cc @yarongmu-google @bythew3i

I think we should instead focus on improving the default parameters given knowledge about the hardware and model.

Great suggestion! My rough thought is to set the max-num-batched-tokens based on roofline analysis, which requires the information of hardware flops and HBM bandwidth. max-num-seqs should be set to large enough to make sure the KV cache utilization can be close to 100% (assume we have a loose TPOT restriction). And it requires the model weights size, HBM capacity and most-model-len. (most-model-len reflects the context length of most requests)

NickLucche

I agree with Michael here I think we should focus on providing "good" defaults but leaving the user the flexibility to adjust them based on their use-case, instead of trying to predict it.

I think we can start by implementing some of Chengji's ideas by detecting the tpu version in arg_utils.py. We use tpu-info on CI but we can likely grab that info at a lower level.

Imo, I wouldn't stray too far from the logic that is already present , or at least we should strive for some level of uniformity.
Ie I feel the addition of something like most-model-len should trigger a broader discussion. Hence I would stick to benchmark-derived defaults to start and align with GPUs.

bvrockwell · 2025-04-27T15:44:31Z

Thanks @Chenyaaang is there a less intrusive way to abstract the class to extend for TPU only?

I think adding hardware specific branching logic like this is undesirable.

@yaochengji

bvrockwell · 2025-04-27T16:03:57Z

Also, requiring max model len is reasonable for TPU (assuming we can extend the class in a non-intrusive way), but I'm wondering if there are general assumptions we can make about the other 2 GIVEN max-model-len is provided?

Few options with pros and cons:

Estimate it at launch:

A general heuristic to estimate pretty good values for --max-num-batched-tokens, --max-num-seqs (given max-model-len is required)
Throw a warning saying this is suboptimal (but might be okay). Direct the user in a warning log to tune if they want to improve performance.
We could use metadata from the HF config file to approximate pretty good values with max-model-len.

Check a few good options during warm up/pre-compilation: Another options or in addition to 1. above, we could run a few checks during the warmup, right? Obvious con is if this extends compilation time. Maybe we enable this with an environment variable ("TPU_WARM_UP_TUNING_TRIES=3, defaults to 1". Something like that.

Chenyaaang · 2025-04-28T22:34:09Z

Thanks for all the comments, the initial request was brought up by @bvrockwell and the goal is to make sure customers are aware of the args they are using, so there's no unintended perf degradation. I understand my current way is intrusive and also hard for customers to input good numbers.

I'll put the new implementation in a separate pr, but before closing this pr, I want to list out the modifications I'll make and get your approval before I implement them. @bvrockwell @mgoin @NickLucche @yaochengji

If not --max-model-len, the default behavior is to derive from hf config, add a log.warning to recommend user set this arg.
Add tpu specific default values for max-num-batched-tokens in arg_utils.py, this is mainly done by roofline estimation, we want to make full use of the computation resource.
max-num-seqs can be inferred by kv cache, it can be later override after initialize_kv_cache.
2 and 3 will be implemented in separate prs.

yaochengji · 2025-04-29T00:28:24Z

Thanks @Chenyaaang ,

I suggest we can implement 1 & 2 first.

For 3, it also depends on average actual model length, the vLLM server doesn't have enough knowledge of this when server starts.

mergify bot added the frontend label Apr 24, 2025

Chenyaaang force-pushed the user-arg branch 2 times, most recently from ab5055d to 1b5b176 Compare April 24, 2025 23:53

mergify bot added the documentation Improvements or additions to documentation label Apr 24, 2025

Chenyaaang force-pushed the user-arg branch 2 times, most recently from 2406788 to c444c96 Compare April 25, 2025 20:16

Chenyaaang changed the title ~~[Frontend] Enforce user input key args to reduce chance of large performance degradation~~ [Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation Apr 25, 2025

enforce user to input max_num_batched_tokens, max_num_seqs, max_model…

f3404ec

…_len to reduce chance of perf degradation Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang force-pushed the user-arg branch from c444c96 to f3404ec Compare April 25, 2025 21:14

vanbasten23 reviewed Apr 26, 2025

View reviewed changes

yaochengji requested review from alexm-redhat and mgoin April 26, 2025 01:09

yaochengji reviewed Apr 26, 2025

View reviewed changes

mgoin reviewed Apr 26, 2025

View reviewed changes

NickLucche suggested changes Apr 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

Chenyaaang commented Apr 24, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 24, 2025

vanbasten23 Apr 26, 2025

Chenyaaang Apr 28, 2025

yaochengji Apr 26, 2025

yaochengji Apr 26, 2025

mgoin left a comment •

edited

Loading

yaochengji commented Apr 27, 2025

NickLucche left a comment

bvrockwell commented Apr 27, 2025

bvrockwell commented Apr 27, 2025

Chenyaaang commented Apr 28, 2025

yaochengji commented Apr 29, 2025

	if device_memory >= 70 * GiB_bytes:
	# For GPUs like H100 and MI300x, use larger default values.
	default_max_num_batched_tokens = {
	UsageContext.LLM_CLASS: 16384,
	UsageContext.OPENAI_API_SERVER: 8192,
	}
	default_max_num_seqs = 1024
	else:
	# TODO(woosuk): Tune the default values for other hardware.
	default_max_num_batched_tokens = {
	UsageContext.LLM_CLASS: 8192,
	UsageContext.OPENAI_API_SERVER: 2048,
	}
	default_max_num_seqs = 256

[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

Are you sure you want to change the base?

[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

Conversation

Chenyaaang commented Apr 24, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 24, 2025

vanbasten23 Apr 26, 2025

Choose a reason for hiding this comment

Chenyaaang Apr 28, 2025

Choose a reason for hiding this comment

yaochengji Apr 26, 2025

Choose a reason for hiding this comment

yaochengji Apr 26, 2025

Choose a reason for hiding this comment

mgoin left a comment • edited Loading

Choose a reason for hiding this comment

yaochengji commented Apr 27, 2025

NickLucche left a comment

Choose a reason for hiding this comment

bvrockwell commented Apr 27, 2025

bvrockwell commented Apr 27, 2025

Chenyaaang commented Apr 28, 2025

yaochengji commented Apr 29, 2025

Chenyaaang commented Apr 24, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading