Skip to content

[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Chenyaaang
Copy link
Contributor

@Chenyaaang Chenyaaang commented Apr 24, 2025

Add arg check for vllm serve subcommand to check user input --max-num-batched-tokens, --max-num-seqs and --max-model-len

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Apr 24, 2025
@Chenyaaang Chenyaaang force-pushed the user-arg branch 2 times, most recently from ab5055d to 1b5b176 Compare April 24, 2025 23:53
@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 24, 2025
@Chenyaaang Chenyaaang force-pushed the user-arg branch 2 times, most recently from 2406788 to c444c96 Compare April 25, 2025 20:16
@Chenyaaang Chenyaaang changed the title [Frontend] Enforce user input key args to reduce chance of large performance degradation [Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation Apr 25, 2025
…_len to reduce chance of perf degradation

Signed-off-by: Chenyaaang <chenyangli@google.com>
@@ -8,7 +8,7 @@ image:
# -- Image tag
tag: "latest"
# -- Container launch command
command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000"]
command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000", "--max-num-batched-tokens", "2048", "--max-num-seqs", "16", "--max-model-len", "2048"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used by TPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment, it's run by lint on cpu. I modified it before I added platform check, I'll remove it.

# Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len
# are passed within command on TPU.
from vllm.platforms import current_platform
if current_platform.is_tpu():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot put it in platform/tpu.py because the args are not None there. cc @NickLucche @alexm-redhat @mgoin

@@ -289,6 +289,19 @@ def validate_parsed_serve_args(args: argparse.Namespace):
raise TypeError("Error: --enable-reasoning requires "
"--reasoning-parser")

# Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some description about why we want to make sure these arguments are passed?

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how requiring these arguments to be set by the user reduces the chance for performance degradation. The user still needs to become an expert to figure out the right values in this case.

I think we should instead focus on improving the default parameters given knowledge about the hardware and model. See this section for an example we use to increase the default values when deploying on hardware with more memory available

vllm/vllm/engine/arg_utils.py

Lines 1609 to 1622 in 52b4f4a

if device_memory >= 70 * GiB_bytes:
# For GPUs like H100 and MI300x, use larger default values.
default_max_num_batched_tokens = {
UsageContext.LLM_CLASS: 16384,
UsageContext.OPENAI_API_SERVER: 8192,
}
default_max_num_seqs = 1024
else:
# TODO(woosuk): Tune the default values for other hardware.
default_max_num_batched_tokens = {
UsageContext.LLM_CLASS: 8192,
UsageContext.OPENAI_API_SERVER: 2048,
}
default_max_num_seqs = 256

I propose we start a section for TPU here.

@yaochengji
Copy link
Collaborator

I don't understand how requiring these arguments to be set by the user reduces the chance for performance degradation.

I remember our idea was the default value of --max-num-batched-tokens, --max-num-seqs might be too large for TPU, which hurts compilation time. And max-model-len is used in kernel tuning, so it should be set to reflect the actual workload. cc @yarongmu-google @bythew3i

I think we should instead focus on improving the default parameters given knowledge about the hardware and model.

Great suggestion! My rough thought is to set the max-num-batched-tokens based on roofline analysis, which requires the information of hardware flops and HBM bandwidth. max-num-seqs should be set to large enough to make sure the KV cache utilization can be close to 100% (assume we have a loose TPOT restriction). And it requires the model weights size, HBM capacity and most-model-len. (most-model-len reflects the context length of most requests)

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Michael here I think we should focus on providing "good" defaults but leaving the user the flexibility to adjust them based on their use-case, instead of trying to predict it.

I think we can start by implementing some of Chengji's ideas by detecting the tpu version in arg_utils.py. We use tpu-info on CI but we can likely grab that info at a lower level.

Imo, I wouldn't stray too far from the logic that is already present , or at least we should strive for some level of uniformity.
Ie I feel the addition of something like most-model-len should trigger a broader discussion. Hence I would stick to benchmark-derived defaults to start and align with GPUs.

@bvrockwell
Copy link
Contributor

Thanks @Chenyaaang is there a less intrusive way to abstract the class to extend for TPU only?

I think adding hardware specific branching logic like this is undesirable.

@yaochengji

@bvrockwell
Copy link
Contributor

Also, requiring max model len is reasonable for TPU (assuming we can extend the class in a non-intrusive way), but I'm wondering if there are general assumptions we can make about the other 2 GIVEN max-model-len is provided?

Few options with pros and cons:

  1. Estimate it at launch:
  • A general heuristic to estimate pretty good values for --max-num-batched-tokens, --max-num-seqs (given max-model-len is required)
  • Throw a warning saying this is suboptimal (but might be okay). Direct the user in a warning log to tune if they want to improve performance.
  • We could use metadata from the HF config file to approximate pretty good values with max-model-len.
  1. Check a few good options during warm up/pre-compilation: Another options or in addition to 1. above, we could run a few checks during the warmup, right? Obvious con is if this extends compilation time. Maybe we enable this with an environment variable ("TPU_WARM_UP_TUNING_TRIES=3, defaults to 1". Something like that.

@Chenyaaang
Copy link
Contributor Author

Thanks for all the comments, the initial request was brought up by @bvrockwell and the goal is to make sure customers are aware of the args they are using, so there's no unintended perf degradation. I understand my current way is intrusive and also hard for customers to input good numbers.

I'll put the new implementation in a separate pr, but before closing this pr, I want to list out the modifications I'll make and get your approval before I implement them. @bvrockwell @mgoin @NickLucche @yaochengji

  1. If not --max-model-len, the default behavior is to derive from hf config, add a log.warning to recommend user set this arg.
  2. Add tpu specific default values for max-num-batched-tokens in arg_utils.py, this is mainly done by roofline estimation, we want to make full use of the computation resource.
  3. max-num-seqs can be inferred by kv cache, it can be later override after initialize_kv_cache.
  4. 2 and 3 will be implemented in separate prs.

@yaochengji
Copy link
Collaborator

Thanks @Chenyaaang ,

I suggest we can implement 1 & 2 first.

For 3, it also depends on average actual model length, the vLLM server doesn't have enough knowledge of this when server starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants