Skip to content

[V1] Add request-level, per-step acceptance counts tracking for spec dec. #16367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

luyuzhe111
Copy link
Contributor

@luyuzhe111 luyuzhe111 commented Apr 9, 2025

This PR adds request-level, per-step acceptance counts tracking for spec dec, as suggested in #15151 (comment), for testing the correctness of EAGLE implementations #15901.

Since this is my first shot at the V1 engine, I am pretty sure there are better designs. So feedbacks are highly appreciated.

Would appreciate your review @WoosukKwon @LiuXiaoxuanPKU @markmc. @ekagra-ranjan I saw you a similar PR earlier today that checks the overall accepted tokens aggregated from all positions and requests. So I will bug you with review too : )

Copy link

github-actions bot commented Apr 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Apr 9, 2025
Copy link

mergify bot commented Apr 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luyuzhe111.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Comment on lines 593 to 605
for i in range(len(generated_token_ids)):
if request.spec_token_acceptance_counts is not None:
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if condition can be brought outside the for loop to avoid repeated checks

stop_reason: Union[int, str, None],
spec_token_acceptance_counts: Optional[list[int]]
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pls add the dim of this list: [bs] or [num_spec_tokens + 1]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sure. it's [num_spec_tokens + 1]. will add it to the annotation above!

@@ -590,6 +590,10 @@ def update_from_output(
num_draft_tokens=len(scheduled_spec_token_ids),
num_accepted_tokens=len(generated_token_ids) - 1)

for i in range(len(generated_token_ids)):
if request.spec_token_acceptance_counts is not None:
request.spec_token_acceptance_counts[i] += 1
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a couple of more metrics need to be added here:

  1. some kind of num_spec_proposals which can be used to divide the accepted counts to get accepted count per proposal. Probably request.spec_token_acceptance_counts[0] would mean the same thing
  2. request.num_accepted_tokens like this which can be used to find the overall Accepted Length per request. This is the longest prefix with accepted tokens at each step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yea request.spec_token_acceptance_counts[0] should be the same thing.
  2. didn't quite get the point. so request.spec_token_acceptance_counts tracks the accepted token counts at each speculation position. sum(request.spec_token_acceptance_counts) will be the total number of tokens accepted so far and sum(request.spec_token_acceptance_counts) / request.spec_token_acceptance_counts[0] will be the acceptance length or mean number of tokens generated per forward pass. if this doesn't address your concern, can you clarify more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I missed that it is using for i in range(len(generated_token_ids)): so all the positions are accepted.


print("-" * 50)
print(f"mean acceptance length: \
{sum(acceptance_counts) / acceptance_counts[0]:.2f}")
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes the Accept Rate per spec head/step but does not compute the Accept Length as done here and discussed in this design doc. In V1, we want to change the definitioin of AL back to what academia uses where if we propose T1, T2, T3, T4 and T3 gets rejected then we say AL is 2 instead of 3 (V0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does compute acceptance length I believe : )

For example, let's say acceptance_counts is [10, 8, 5, 1]. Then this means there was 10 forward passes (proposal + verification), 8, 5, 1 tokens at speculation step 1, 2, and 3 were accepted, and a total of sum([10, 8, 5, 1]) = 24 tokens were generated. So the mean acceptance length is simply 24 / 10 = 2.4.

This acceptance_counts also helps understand the marginal gain from each additional speculation step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good explanation. And to check our understanding ... acceptance rate would be the proportion of draft tokens that were accepted, so:

acceptance_rate = (8 + 5 + 1) / (10 * 3)

There were 10 steps, and 3 draft tokens per were proposed per spec, but only 14 of those were accepted. Right?

Copy link
Contributor Author

@luyuzhe111 luyuzhe111 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the formula seems to align with the definition, tho i feel like the per step acceptance rate is probably more helpful ([8 / 10, 5 / 10, 1/10]).

Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thing is to try to use SpecDecodingStats and do the aggregation on the frontend side. Thanks!


print("-" * 50)
print(f"mean acceptance length: \
{sum(acceptance_counts) / acceptance_counts[0]:.2f}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good explanation. And to check our understanding ... acceptance rate would be the proportion of draft tokens that were accepted, so:

acceptance_rate = (8 + 5 + 1) / (10 * 3)

There were 10 steps, and 3 draft tokens per were proposed per spec, but only 14 of those were accepted. Right?

@@ -101,6 +102,7 @@ class EngineCoreOutput(
finish_reason: Optional[FinishReason] = None
stop_reason: Union[int, str, None] = None
events: Optional[list[EngineCoreEvent]] = None
spec_token_acceptance_counts: Optional[list[int]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we're sending the per-request running count from the frontend to the engine, aggregating the per-request acceptance counts for the step, and sending back an updated running count

In V1, we've been following a pattern of avoiding aggregating metrics in the engine core, so if you look at e.g. LoggingStatLogger, we aggregate num_generation_tokens per step there - the aggregation isn't happening in the engine:

    def _track_iteration_stats(self, iteration_stats: IterationStats):
        # Save tracked stats for token counters.                                                                                                                                                               
        self.num_prompt_tokens.append(iteration_stats.num_prompt_tokens)
	self.num_generation_tokens.append(
            iteration_stats.num_generation_tokens)

In the case of Prometheus, we're using its Counter abstraction to aggregate

For per-request counts like this, we're doing this sort of aggregation:

        for finished_request in iteration_stats.finished_requests:
            ...
            self.histogram_num_prompt_tokens_request.observe(
                finished_request.num_prompt_tokens)

So, generally speaking, I'm expecting that we would send per-request acceptance count updates from the engine core via SchedulerStats.SpecDecodingStats.

Could you give that a go and see how it works out?

Also, as part of adding this, I would also like us to figure out how to expose it as a Prometheus metric 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really appreciate the primer on the design principles! I will give it try. Regarding prometheus metric, are you suggesting to add it in this PR or simply some discussions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add it in the PR so that we avoid collecting a metric - or aggregating it in certain way - that would make it difficult to expose in an appropriate way in Prometheus

sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)

outputs = llm.generate(prompt_token_ids=prompt_ids,
sampling_params=sampling_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing there is no functional change intended with the above delta? It is hard as a reviewer to verify that though

Best practise is to try to limit a PR to a single logical change - unless I'm missing something, the above changes are unrelated to acceptance counts, so just remove them from this PR?

Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main branch of the author was outdated and Git UI is comparing the author's local main branch with the author's feature branch. Most of the changes highlighted in this file in this PR was made in an older PR: #16100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekagra-ranjan Thanks for the explanation : ) yea I think I messed it up when applying the stashed changes. the core changes are from the previous commit you linked. @markmc There are indeed some necessary changes towards the end on retrieving the spec token acceptance counts for v1. I will try to avoid such incidents in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks @ekagra-ranjan. Yes, @luyuzhe111 you'll want to rebase and resolve the conflicts:

$ git fetch origin
$ git checkout luyuzhe111/spec_dec_stats
$ git rebase origin/main
Auto-merging examples/offline_inference/eagle.py
CONFLICT (content): Merge conflict in examples/offline_inference/eagle.py
Auto-merging vllm/v1/core/sched/scheduler.py
Auto-merging vllm/v1/engine/__init__.py
Auto-merging vllm/v1/engine/processor.py
Auto-merging vllm/v1/request.py
CONFLICT (content): Merge conflict in vllm/v1/request.py
error: could not apply 104dba492... add request level, per-step acceptance counts tracking for spec dec
...

Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
@luyuzhe111
Copy link
Contributor Author

Hello Mark @markmc,

I made an attempt to refine the design and address your comments. Overall, I made the following changes:

  1. Avoided aggregating metrics in the engine core by using SchedulerStats.SpecDecodingStats to keep track the acceptance counts and accumulate in the output processing stage.
  2. Since we want request-level stats, I add an attribute called per_request_stats, whose keys are req ids and values are number of accepted tokens, to SpecDecodingStats.
  3. Minimizing passing the acceptance counts list around by registering it as an attribute of RequestState.

I did not take into account Prometheus since I'm not too confident about the best design practice. I'm also hoping to get this PR merged soon so that people can begin debugging more easily. Could you help me a bit if you think it's best we also add it as part of this PR?

Thanks and would appreciate your review!

cc @LiuXiaoxuanPKU

for output in outputs:
for step, count in enumerate(
output.spec_token_acceptance_counts[0]):
acceptance_counts[step] += count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few things to note here:

  1. The RequestOutput.metrics API in general is not implemented in V1 (and maybe no longer works in V0). So far, I would say there is no decision whether to retain this API in the long term
  2. With Prometheus, I would say we will expose the "num_accepted_tokens per draft position" metric as an aggregate across all requests. See this design doc
  3. A viable replacement for the RequestOutput.metrics API might be to expose the in-memory Prometheus metrics via an API. In which case, we would only be exposing via the API the same metrics that are exposed via Prometheus, so I don't want to introduce a metric that will not be available in the Prometheus.
  4. What's actually being printed below is the mean acceptance rate which can be more easily calculated as num_accepted_tokens_total / num_drafts which is much more in line with what we will expose through Prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc regarding 1, 2, 3, they all make sense.

regarding 4, it's indeed the same, but we also have the option of looking at acceptance rate at each position (should just be num_accepted_tokens per draft position) since we have kept tracked of the acceptance counts. I only computed the AL for simplicity since we are not doing finer-grained analysis yet.

also a quick clarification question: are you suggesting that we actually don't need this per request spec_token_acceptance_counts thing once Prometheus's ready? My understanding is that Prometheus does not track request level stats. But if it does, then we probably don't need this PR since I also feel like output.spec_token_acceptance_counts is a bit ugly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc, I'm proposing adding:

  1. num_drafts so mean acceptance length can be calculated
  2. num_accepted_tokens_per_pos to track the acceptance rate per position

But aggregating across all requests - you would not be able to look at the acceptance counts of individual requests

However, note that Prometheus does allow you to slice according to time intervals - so with a long-running vLLM instance, you can look at these aggregated-across-all-requests metrics over whatever time period you choose

How would you imagine using per-request stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition to debugging, another use case for per-request stats is for fine-grained benchmarking. For example, I might want to understand the acceptance length for different tasks (summarization, coding) and RequestOutput.metrics used to be very helpful.

Per-request stats might also help identify outliers, for example, requests with really poor acceptance counts to shed light on how to further improve the speculator.

@markmc
Copy link
Member

markmc commented Apr 14, 2025

I did not take into account Prometheus since I'm not too confident about the best design practice. I'm also hoping to get this PR merged soon so that people can begin debugging more easily.

Sorry if my latest comment is a bit frustrating - I spent some time looking at this today through the lens of adding new Prometheus metrics:

  • num_drafts - a count of the number of draft proposals made, useful for calculating mean acceptance length
  • num_accepted_tokens_per_pos - a count per draft position of the number of accepted tokens, useful for translating into a probability that a given token position will be accepted

Will dig into this more tomorrow 👍

@luyuzhe111 luyuzhe111 requested a review from markmc April 14, 2025 17:20
@luyuzhe111
Copy link
Contributor Author

@markmc Oh no not at all! I was just trying to say I want to avoid spending too much to get it work to the extent of blocking others since I'm not an expert. Really appreciate your help and guidance!

@markmc
Copy link
Member

markmc commented Apr 14, 2025

Oh, I also meant to say ... we should be a bit cautious about getting new APIs right because they're difficult/disruptive to remove later. Temporarily printing metrics as debug logging might be a quicker path to support folks getting V1 spec decoding working.

@luyuzhe111
Copy link
Contributor Author

luyuzhe111 commented Apr 14, 2025

Oh, I also meant to say ... we should be a bit cautious about getting new APIs right because they're difficult/disruptive to remove later. Temporarily printing metrics as debug logging might be a quicker path to support folks getting V1 spec decoding working.

@markmc Yep totally understood. This PR is indeed quite disruptive (I don't personally like it) and I should have taken a more top-down approach (starting from a design doc like you did)!

Meanwhile, I kinda disagree printing metrics is the most helpful way for debugging since that's how we end up having bugs uncaught in v0 in my opinion. Once the printing metrics is somehow "good enough" it's hard to say whether the model is drafting as accurately as it should. Personally I prefer the ability look at acceptance counts for individual requests for head-to-head comparison. But I guess this code can reside somewhere and does not have to merge to the main codebase? Or maybe Prometheus can already do the job. I could not clearly tell so I left a question above.

Thanks again for your time and discussion!

@markmc
Copy link
Member

markmc commented Apr 15, 2025

Meanwhile, I kinda disagree printing metrics is the most helpful way for debugging

Yep, I just mentioned it as a potential stop-gap solution if metrics take e.g. a week to merge

I prefer the ability look at acceptance counts for individual requests for head-to-head comparison.

Could you lay out exactly how this would work to help me understand? An example with fake data for like 10 requests, and what a head-to-head comparison would look like? Thanks!

@markmc
Copy link
Member

markmc commented Apr 15, 2025

See #16665 for logging mean acceptance length and per-position num_accepted_tokens e.g.

INFO 04-15 10:05:05 [metrics.py:82] SpecDecoding metrics: Draft acceptance rate: 48.1%, Mean acceptance length: 2.40, Accepted: 3323 tokens, Drafted: 6915 tokens, Per-position acceptance probabilities: 0.712, 0.531, 0.452, 0.377, 0.331

(that example is with ngram, not EAGLE)

Note this currently only works with the online serving API server, not the offline inference API in your example

@luyuzhe111
Copy link
Contributor Author

@markmc Hi Mark, the mean acceptance length & per-position num_accepted_tokens are super helpful. I will take a look at your PR first since those metrics for the online server to be there anyways!

@luyuzhe111
Copy link
Contributor Author

luyuzhe111 commented Apr 15, 2025

Could you lay out exactly how this would work to help me understand? An example with fake data for like 10 requests, and what a head-to-head comparison would look like? Thanks!

@markmc Yea what I meant for head-to-head comparison is to say, use the same request to get acceptance counts from both vLLM and EAGLE repo to compare how the discrepancy accumulates between step positions. Not sure about others but that was my plan of debugging further.

Copy link

mergify bot commented Apr 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luyuzhe111.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@WoosukKwon
Copy link
Collaborator

@luyuzhe111 @markmc What's the status of this PR? Is there any blocker?

@luyuzhe111
Copy link
Contributor Author

@WoosukKwon It's functionally ready but could benefit from a better design is my understanding. Previously, I added this feature in v0 through the RequestMetrics API, which is currently not used in v1. If you guys have a specific design in mind I'm happy to implement it.

@markmc
Copy link
Member

markmc commented Apr 22, 2025

@WoosukKwon @luyuzhe111 take a look at #17010 for how we can use the aggregated metrics from Prometheus for offline inferencing too

@luyuzhe111
Copy link
Contributor Author

take a look at #17010 for how we can use the aggregated metrics from Prometheus for offline inferencing too

@markmc it's great that we can use prometheus for this purpose! what do you think of the use cases for per-request stats I mentioned above tho? To me, in LLM serving, stats aggregation over requests just seems more natural than aggregation over time. I think there will be many use cases where we end up wanting per-request stats (not just spec acceptance rate) since we care about user satisfaction on a request level; in contrast, per-time period stats just tells very little in this regard. A few request-level stats on top of my mind that could be helpful: ttft, max itl, cache hit rate, etc. please correct me if i'm wrong. cc @WoosukKwon

@markmc
Copy link
Member

markmc commented Apr 25, 2025

take a look at #17010 for how we can use the aggregated metrics from Prometheus for offline inferencing too

@markmc it's great that we can use prometheus for this purpose! what do you think of the use cases for per-request stats I mentioned above tho? To me, in LLM serving, stats aggregation over requests just seems more natural than aggregation over time. I think there will be many use cases where we end up wanting per-request stats (not just spec acceptance rate) since we care about user satisfaction on a request level; in contrast, per-time period stats just tells very little in this regard. A few request-level stats on top of my mind that could be helpful: ttft, max itl, cache hit rate, etc. please correct me if i'm wrong. cc @WoosukKwon

There's a lot to unpack here, and it deserves a thoughtful discussion ... in the context of input from users with a very concrete use case (IMO). For our use case right now (developers implementing EAGLE, comparing acceptance rates between implementations), aggregating across requests is obviously fine (since that's what our script does).

AIUI, in V1 we have learned some lessons from V0 and we are cautious about carrying features over from V0, examining each feature in the context of a use case (do we really need this feature?) and trying hard for a cleaner design (with an eye to longer-term maintainability).

In terms of use cases, I think what you're describing is the possibility that some external implementations will want to aggregate per-request metrics across some other dimension than a time interval. But the set of possible metrics that such a possible implementation might be interested is very large. And it's quite possible folks might want per-iteration metrics (not per-request) so e.g. you can examine the effect of batch sizes. An abstract discussion of what might be useful leads us to add significant runtime and maintenance overhead IMO

In terms of a clean design - it looks to me like the current RequestMetrics being part of the public API emerged in quite an ad-hoc way from #2316 where they were clearly intended to be aggregated before consumption (see #1870). My thought is that collecting and exposing data about individual requests feels more like "tracing" than "metrics" - see my note in the v1 metrics design doc on OpenTelemetry, but I'm no expert on tracing use cases or OpenTelemetry.

So I see two quite distinct questions:

  1. For our immediate use case of aggregated spec decoding metrics when using the offline inferencing API, I propose exposing the same aggregated metrics we already expose in the online case via Prometheus
  2. For use cases that might demand metrics on individual requests without any cross-request aggregation, we should take the time to more deeply understand the use cases and also implementation options like using OpenTelemetry.

Hope that makes sense.

@luyuzhe111
Copy link
Contributor Author

Hi @markmc, thanks for your detailed reply and explanation! I think your proposal makes a lot of sense. @WoosukKwon maybe we should prioritize getting this PR reviewed? Once it's merged, we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation needs-rebase speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants