-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[V1] Add request-level, per-step acceptance counts tracking for spec dec. #16367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
vllm/v1/core/sched/scheduler.py
Outdated
for i in range(len(generated_token_ids)): | ||
if request.spec_token_acceptance_counts is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if condition can be brought outside the for loop to avoid repeated checks
stop_reason: Union[int, str, None], | ||
spec_token_acceptance_counts: Optional[list[int]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pls add the dim of this list: [bs] or [num_spec_tokens + 1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sure. it's [num_spec_tokens + 1]. will add it to the annotation above!
vllm/v1/core/sched/scheduler.py
Outdated
@@ -590,6 +590,10 @@ def update_from_output( | |||
num_draft_tokens=len(scheduled_spec_token_ids), | |||
num_accepted_tokens=len(generated_token_ids) - 1) | |||
|
|||
for i in range(len(generated_token_ids)): | |||
if request.spec_token_acceptance_counts is not None: | |||
request.spec_token_acceptance_counts[i] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a couple of more metrics need to be added here:
- some kind of
num_spec_proposals
which can be used to divide the accepted counts to get accepted count per proposal. Probablyrequest.spec_token_acceptance_counts[0]
would mean the same thing request.num_accepted_tokens
like this which can be used to find the overall Accepted Length per request. This is the longest prefix with accepted tokens at each step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yea
request.spec_token_acceptance_counts[0]
should be the same thing. - didn't quite get the point. so
request.spec_token_acceptance_counts
tracks the accepted token counts at each speculation position.sum(request.spec_token_acceptance_counts)
will be the total number of tokens accepted so far andsum(request.spec_token_acceptance_counts) / request.spec_token_acceptance_counts[0]
will be the acceptance length or mean number of tokens generated per forward pass. if this doesn't address your concern, can you clarify more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. I missed that it is using for i in range(len(generated_token_ids)):
so all the positions are accepted.
examples/offline_inference/eagle.py
Outdated
|
||
print("-" * 50) | ||
print(f"mean acceptance length: \ | ||
{sum(acceptance_counts) / acceptance_counts[0]:.2f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This computes the Accept Rate per spec head/step but does not compute the Accept Length as done here and discussed in this design doc. In V1, we want to change the definitioin of AL back to what academia uses where if we propose T1, T2, T3, T4 and T3 gets rejected then we say AL is 2 instead of 3 (V0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does compute acceptance length I believe : )
For example, let's say acceptance_counts
is [10, 8, 5, 1]. Then this means there was 10 forward passes (proposal + verification), 8, 5, 1 tokens at speculation step 1, 2, and 3 were accepted, and a total of sum([10, 8, 5, 1])
= 24 tokens were generated. So the mean acceptance length is simply 24 / 10 = 2.4.
This acceptance_counts
also helps understand the marginal gain from each additional speculation step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good explanation. And to check our understanding ... acceptance rate would be the proportion of draft tokens that were accepted, so:
acceptance_rate = (8 + 5 + 1) / (10 * 3)
There were 10 steps, and 3 draft tokens per were proposed per spec, but only 14 of those were accepted. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the formula seems to align with the definition, tho i feel like the per step acceptance rate is probably more helpful ([8 / 10, 5 / 10, 1/10]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main thing is to try to use SpecDecodingStats
and do the aggregation on the frontend side. Thanks!
examples/offline_inference/eagle.py
Outdated
|
||
print("-" * 50) | ||
print(f"mean acceptance length: \ | ||
{sum(acceptance_counts) / acceptance_counts[0]:.2f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good explanation. And to check our understanding ... acceptance rate would be the proportion of draft tokens that were accepted, so:
acceptance_rate = (8 + 5 + 1) / (10 * 3)
There were 10 steps, and 3 draft tokens per were proposed per spec, but only 14 of those were accepted. Right?
vllm/v1/engine/__init__.py
Outdated
@@ -101,6 +102,7 @@ class EngineCoreOutput( | |||
finish_reason: Optional[FinishReason] = None | |||
stop_reason: Union[int, str, None] = None | |||
events: Optional[list[EngineCoreEvent]] = None | |||
spec_token_acceptance_counts: Optional[list[int]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we're sending the per-request running count from the frontend to the engine, aggregating the per-request acceptance counts for the step, and sending back an updated running count
In V1, we've been following a pattern of avoiding aggregating metrics in the engine core, so if you look at e.g. LoggingStatLogger
, we aggregate num_generation_tokens
per step there - the aggregation isn't happening in the engine:
def _track_iteration_stats(self, iteration_stats: IterationStats):
# Save tracked stats for token counters.
self.num_prompt_tokens.append(iteration_stats.num_prompt_tokens)
self.num_generation_tokens.append(
iteration_stats.num_generation_tokens)
In the case of Prometheus, we're using its Counter
abstraction to aggregate
For per-request counts like this, we're doing this sort of aggregation:
for finished_request in iteration_stats.finished_requests:
...
self.histogram_num_prompt_tokens_request.observe(
finished_request.num_prompt_tokens)
So, generally speaking, I'm expecting that we would send per-request acceptance count updates from the engine core via SchedulerStats.SpecDecodingStats
.
Could you give that a go and see how it works out?
Also, as part of adding this, I would also like us to figure out how to expose it as a Prometheus metric 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really appreciate the primer on the design principles! I will give it try. Regarding prometheus metric, are you suggesting to add it in this PR or simply some discussions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add it in the PR so that we avoid collecting a metric - or aggregating it in certain way - that would make it difficult to expose in an appropriate way in Prometheus
examples/offline_inference/eagle.py
Outdated
sampling_params = SamplingParams(temperature=args.temp, max_tokens=256) | ||
|
||
outputs = llm.generate(prompt_token_ids=prompt_ids, | ||
sampling_params=sampling_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing there is no functional change intended with the above delta? It is hard as a reviewer to verify that though
Best practise is to try to limit a PR to a single logical change - unless I'm missing something, the above changes are unrelated to acceptance counts, so just remove them from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main branch of the author was outdated and Git UI is comparing the author's local main branch with the author's feature branch. Most of the changes highlighted in this file in this PR was made in an older PR: #16100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Thanks for the explanation : ) yea I think I messed it up when applying the stashed changes. the core changes are from the previous commit you linked. @markmc There are indeed some necessary changes towards the end on retrieving the spec token acceptance counts for v1. I will try to avoid such incidents in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks @ekagra-ranjan. Yes, @luyuzhe111 you'll want to rebase and resolve the conflicts:
$ git fetch origin
$ git checkout luyuzhe111/spec_dec_stats
$ git rebase origin/main
Auto-merging examples/offline_inference/eagle.py
CONFLICT (content): Merge conflict in examples/offline_inference/eagle.py
Auto-merging vllm/v1/core/sched/scheduler.py
Auto-merging vllm/v1/engine/__init__.py
Auto-merging vllm/v1/engine/processor.py
Auto-merging vllm/v1/request.py
CONFLICT (content): Merge conflict in vllm/v1/request.py
error: could not apply 104dba492... add request level, per-step acceptance counts tracking for spec dec
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
104dba4
to
d21afbf
Compare
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Hello Mark @markmc, I made an attempt to refine the design and address your comments. Overall, I made the following changes:
I did not take into account Prometheus since I'm not too confident about the best design practice. I'm also hoping to get this PR merged soon so that people can begin debugging more easily. Could you help me a bit if you think it's best we also add it as part of this PR? Thanks and would appreciate your review! |
for output in outputs: | ||
for step, count in enumerate( | ||
output.spec_token_acceptance_counts[0]): | ||
acceptance_counts[step] += count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few things to note here:
- The
RequestOutput.metrics
API in general is not implemented in V1 (and maybe no longer works in V0). So far, I would say there is no decision whether to retain this API in the long term - With Prometheus, I would say we will expose the "num_accepted_tokens per draft position" metric as an aggregate across all requests. See this design doc
- A viable replacement for the
RequestOutput.metrics
API might be to expose the in-memory Prometheus metrics via an API. In which case, we would only be exposing via the API the same metrics that are exposed via Prometheus, so I don't want to introduce a metric that will not be available in the Prometheus. - What's actually being printed below is the mean acceptance rate which can be more easily calculated as
num_accepted_tokens_total / num_drafts
which is much more in line with what we will expose through Prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmc regarding 1, 2, 3, they all make sense.
regarding 4, it's indeed the same, but we also have the option of looking at acceptance rate at each position (should just be num_accepted_tokens per draft position
) since we have kept tracked of the acceptance counts. I only computed the AL for simplicity since we are not doing finer-grained analysis yet.
also a quick clarification question: are you suggesting that we actually don't need this per request spec_token_acceptance_counts
thing once Prometheus's ready? My understanding is that Prometheus does not track request level stats. But if it does, then we probably don't need this PR since I also feel like output.spec_token_acceptance_counts
is a bit ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the design doc, I'm proposing adding:
num_drafts
so mean acceptance length can be calculatednum_accepted_tokens_per_pos
to track the acceptance rate per position
But aggregating across all requests - you would not be able to look at the acceptance counts of individual requests
However, note that Prometheus does allow you to slice according to time intervals - so with a long-running vLLM instance, you can look at these aggregated-across-all-requests metrics over whatever time period you choose
How would you imagine using per-request stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addition to debugging, another use case for per-request stats is for fine-grained benchmarking. For example, I might want to understand the acceptance length for different tasks (summarization, coding) and RequestOutput.metrics
used to be very helpful.
Per-request stats might also help identify outliers, for example, requests with really poor acceptance counts to shed light on how to further improve the speculator.
Sorry if my latest comment is a bit frustrating - I spent some time looking at this today through the lens of adding new Prometheus metrics:
Will dig into this more tomorrow 👍 |
@markmc Oh no not at all! I was just trying to say I want to avoid spending too much to get it work to the extent of blocking others since I'm not an expert. Really appreciate your help and guidance! |
Oh, I also meant to say ... we should be a bit cautious about getting new APIs right because they're difficult/disruptive to remove later. Temporarily printing metrics as debug logging might be a quicker path to support folks getting V1 spec decoding working. |
@markmc Yep totally understood. This PR is indeed quite disruptive (I don't personally like it) and I should have taken a more top-down approach (starting from a design doc like you did)! Meanwhile, I kinda disagree printing metrics is the most helpful way for debugging since that's how we end up having bugs uncaught in v0 in my opinion. Once the printing metrics is somehow "good enough" it's hard to say whether the model is drafting as accurately as it should. Personally I prefer the ability look at acceptance counts for individual requests for head-to-head comparison. But I guess this code can reside somewhere and does not have to merge to the main codebase? Or maybe Prometheus can already do the job. I could not clearly tell so I left a question above. Thanks again for your time and discussion! |
Yep, I just mentioned it as a potential stop-gap solution if metrics take e.g. a week to merge
Could you lay out exactly how this would work to help me understand? An example with fake data for like 10 requests, and what a head-to-head comparison would look like? Thanks! |
See #16665 for logging mean acceptance length and per-position num_accepted_tokens e.g.
(that example is with ngram, not EAGLE) Note this currently only works with the online serving API server, not the offline inference API in your example |
@markmc Hi Mark, the mean acceptance length & per-position num_accepted_tokens are super helpful. I will take a look at your PR first since those metrics for the online server to be there anyways! |
@markmc Yea what I meant for head-to-head comparison is to say, use the same request to get acceptance counts from both vLLM and EAGLE repo to compare how the discrepancy accumulates between step positions. Not sure about others but that was my plan of debugging further. |
This pull request has merge conflicts that must be resolved before it can be |
@luyuzhe111 @markmc What's the status of this PR? Is there any blocker? |
@WoosukKwon It's functionally ready but could benefit from a better design is my understanding. Previously, I added this feature in v0 through the |
@WoosukKwon @luyuzhe111 take a look at #17010 for how we can use the aggregated metrics from Prometheus for offline inferencing too |
@markmc it's great that we can use prometheus for this purpose! what do you think of the use cases for per-request stats I mentioned above tho? To me, in LLM serving, stats aggregation over requests just seems more natural than aggregation over time. I think there will be many use cases where we end up wanting per-request stats (not just spec acceptance rate) since we care about user satisfaction on a request level; in contrast, per-time period stats just tells very little in this regard. A few request-level stats on top of my mind that could be helpful: ttft, max itl, cache hit rate, etc. please correct me if i'm wrong. cc @WoosukKwon |
There's a lot to unpack here, and it deserves a thoughtful discussion ... in the context of input from users with a very concrete use case (IMO). For our use case right now (developers implementing EAGLE, comparing acceptance rates between implementations), aggregating across requests is obviously fine (since that's what our script does). AIUI, in V1 we have learned some lessons from V0 and we are cautious about carrying features over from V0, examining each feature in the context of a use case (do we really need this feature?) and trying hard for a cleaner design (with an eye to longer-term maintainability). In terms of use cases, I think what you're describing is the possibility that some external implementations will want to aggregate per-request metrics across some other dimension than a time interval. But the set of possible metrics that such a possible implementation might be interested is very large. And it's quite possible folks might want per-iteration metrics (not per-request) so e.g. you can examine the effect of batch sizes. An abstract discussion of what might be useful leads us to add significant runtime and maintenance overhead IMO In terms of a clean design - it looks to me like the current So I see two quite distinct questions:
Hope that makes sense. |
Hi @markmc, thanks for your detailed reply and explanation! I think your proposal makes a lot of sense. @WoosukKwon maybe we should prioritize getting this PR reviewed? Once it's merged, we can close this PR. |
This PR adds request-level, per-step acceptance counts tracking for spec dec, as suggested in #15151 (comment), for testing the correctness of EAGLE implementations #15901.
Since this is my first shot at the V1 engine, I am pretty sure there are better designs. So feedbacks are highly appreciated.
Would appreciate your review @WoosukKwon @LiuXiaoxuanPKU @markmc. @ekagra-ranjan I saw you a similar PR earlier today that checks the overall accepted tokens aggregated from all positions and requests. So I will bug you with review too : )