-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation #17145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
ab5055d
to
1b5b176
Compare
2406788
to
c444c96
Compare
…_len to reduce chance of perf degradation Signed-off-by: Chenyaaang <chenyangli@google.com>
@@ -8,7 +8,7 @@ image: | |||
# -- Image tag | |||
tag: "latest" | |||
# -- Container launch command | |||
command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000"] | |||
command: ["vllm", "serve", "/data/", "--served-model-name", "opt-125m", "--dtype", "bfloat16", "--host", "0.0.0.0", "--port", "8000", "--max-num-batched-tokens", "2048", "--max-num-seqs", "16", "--max-model-len", "2048"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used by TPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment, it's run by lint on cpu. I modified it before I added platform check, I'll remove it.
# Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len | ||
# are passed within command on TPU. | ||
from vllm.platforms import current_platform | ||
if current_platform.is_tpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot put it in platform/tpu.py
because the args are not None there. cc @NickLucche @alexm-redhat @mgoin
@@ -289,6 +289,19 @@ def validate_parsed_serve_args(args: argparse.Namespace): | |||
raise TypeError("Error: --enable-reasoning requires " | |||
"--reasoning-parser") | |||
|
|||
# Ensure that --max-num-batched-tokens, --max-num-seqs, --max-model-len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some description about why we want to make sure these arguments are passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how requiring these arguments to be set by the user reduces the chance for performance degradation. The user still needs to become an expert to figure out the right values in this case.
I think we should instead focus on improving the default parameters given knowledge about the hardware and model. See this section for an example we use to increase the default values when deploying on hardware with more memory available
Lines 1609 to 1622 in 52b4f4a
if device_memory >= 70 * GiB_bytes: | |
# For GPUs like H100 and MI300x, use larger default values. | |
default_max_num_batched_tokens = { | |
UsageContext.LLM_CLASS: 16384, | |
UsageContext.OPENAI_API_SERVER: 8192, | |
} | |
default_max_num_seqs = 1024 | |
else: | |
# TODO(woosuk): Tune the default values for other hardware. | |
default_max_num_batched_tokens = { | |
UsageContext.LLM_CLASS: 8192, | |
UsageContext.OPENAI_API_SERVER: 2048, | |
} | |
default_max_num_seqs = 256 |
I propose we start a section for TPU here.
I remember our idea was the default value of
Great suggestion! My rough thought is to set the max-num-batched-tokens based on roofline analysis, which requires the information of hardware flops and HBM bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Michael here I think we should focus on providing "good" defaults but leaving the user the flexibility to adjust them based on their use-case, instead of trying to predict it.
I think we can start by implementing some of Chengji's ideas by detecting the tpu version in arg_utils.py
. We use tpu-info
on CI but we can likely grab that info at a lower level.
Imo, I wouldn't stray too far from the logic that is already present , or at least we should strive for some level of uniformity.
Ie I feel the addition of something like most-model-len
should trigger a broader discussion. Hence I would stick to benchmark-derived defaults to start and align with GPUs.
Thanks @Chenyaaang is there a less intrusive way to abstract the class to extend for TPU only? I think adding hardware specific branching logic like this is undesirable. |
Also, requiring max model len is reasonable for TPU (assuming we can extend the class in a non-intrusive way), but I'm wondering if there are general assumptions we can make about the other 2 GIVEN max-model-len is provided? Few options with pros and cons:
|
Thanks for all the comments, the initial request was brought up by @bvrockwell and the goal is to make sure customers are aware of the args they are using, so there's no unintended perf degradation. I understand my current way is intrusive and also hard for customers to input good numbers. I'll put the new implementation in a separate pr, but before closing this pr, I want to list out the modifications I'll make and get your approval before I implement them. @bvrockwell @mgoin @NickLucche @yaochengji
|
Thanks @Chenyaaang , I suggest we can implement 1 & 2 first. For 3, it also depends on average actual model length, the vLLM server doesn't have enough knowledge of this when server starts. |
Add arg check for vllm serve subcommand to check user input
--max-num-batched-tokens
,--max-num-seqs
and--max-model-len