[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention #17139

gshtras · 2025-04-24T21:44:07Z

An option to apply fp8 output scale in ROCm custom paged attention and output FP8 tensor
In case a non-None scale tensor is passed to the kernel, the output tensor is expected to be in the current_platform.fp8_dtype() type (float8_fnuz or float8_fn), and the scale is applied to it before storing into an 8-bit type

…d output FP8 tensor Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

github-actions · 2025-04-24T21:44:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

…uantizing in the flash attention kernel for V1 Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

ProExpertProg

2 nits, and could we add this case to tests?

ProExpertProg · 2025-04-25T04:07:05Z

csrc/rocm/attention.cu

+  // NOTE: fp8_out_scale is optional.
+  const float* fp8_out_scale_ptr =
+      fp8_out_scale
+          ? reinterpret_cast<const float*>(fp8_out_scale.value().data_ptr())


Nit: static cast?

ProExpertProg · 2025-04-25T04:08:20Z

csrc/rocm/attention.cu

+  const float* fp8_out_scale_ptr =
+      fp8_out_scale
+          ? reinterpret_cast<const float*>(fp8_out_scale.value().data_ptr())
+          : nullptr;
  OUTT* out_ptr = reinterpret_cast<OUTT*>(out.data_ptr());


Should the OUTT type be fp8 if scale is given? Is that captured automatically? Maybe we could assert this somewhere

Also, should tmp_output be the same type as output? So if output is fp8, is tmp_output also fp8?

Should the OUTT type be fp8 if scale is given? Is that captured automatically? Maybe we could assert this somewhere

This is ensured at https://github.com/vllm-project/vllm/pull/17139/files#diff-79b8261aa73f07cc7450e48c8e14150576656f19ccfb42ba972860092c1f5949R1779-R1786

Also, should tmp_output be the same type as output? So if output is fp8, is tmp_output also fp8?

No, it should be the same type as query, it is used in the internal calculations

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

commit 9f733ff Author: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Date: Fri Apr 25 22:10:58 2025 +0000 Using static cast Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> commit 2d7dba5 Author: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Date: Thu Apr 24 21:37:16 2025 +0000 An option to apply fp8 output scale in ROCm custom paged attention and output FP8 tensor Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Luka Govedič <lgovedic@redhat.com>

houseroad · 2025-05-01T19:22:55Z

csrc/rocm/attention.cu

@@ -1238,6 +1240,8 @@ __launch_bounds__(NUM_THREADS) void paged_attention_ll4mi_QKV_mfma4_kernel(

  // final write to tmp_out after vout accumulation
  if (warpid == 0) {
+    const float out_scale =


wondering where out_scale is used here?

It is actually used in the reduction kernel launched after either of the attention kernels.
The dereferencing here is indeed not needed, but it'll get optimized out. I'll make a note to clean it up

Could you just remove it in this PR?

…age1

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

vllm-project#17139) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

vllm-project#17139) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

vllm-project#17139) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

An option to apply fp8 output scale in ROCm custom paged attention an…

2d7dba5

…d output FP8 tensor Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras added a commit to ROCm/vllm that referenced this pull request Apr 24, 2025

A follow up PR to vllm-project#17139. Applying FPu output scale and q…

bfd9fc7

…uantizing in the flash attention kernel for V1 Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras mentioned this pull request Apr 24, 2025

[Kernel] FP8 quantization fused into V1 Triton Attention #17143

Open

ProExpertProg reviewed Apr 25, 2025

View reviewed changes

ProExpertProg mentioned this pull request Apr 25, 2025

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass #16756

Open

Using static cast

9f733ff

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

houseroad added the rocm Related to AMD ROCm label Apr 26, 2025

gshtras requested a review from ProExpertProg April 28, 2025 15:08

ProExpertProg approved these changes May 1, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 1, 2025

houseroad reviewed May 1, 2025

View reviewed changes

gshtras added 2 commits May 1, 2025 20:17

Merge remote-tracking branch 'upstream/main' into rocm_fp8_out_cpa_st…

b888e6b

…age1

Clean up dead code

853b95b

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

robertgshaw2-redhat enabled auto-merge (squash) May 5, 2025 17:07

robertgshaw2-redhat approved these changes May 5, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into rocm_fp8_out_cpa_stage1

4102f56

vllm-bot merged commit 32aa74c into vllm-project:main May 7, 2025
77 of 80 checks passed

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention (

b525b34

vllm-project#17139) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention #17139

[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention #17139

gshtras commented Apr 24, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 24, 2025

ProExpertProg left a comment

ProExpertProg Apr 25, 2025

ProExpertProg Apr 25, 2025

ProExpertProg Apr 25, 2025

gshtras Apr 25, 2025

houseroad May 1, 2025

gshtras May 1, 2025

ProExpertProg May 5, 2025

[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention #17139

[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention #17139

Conversation

gshtras commented Apr 24, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 24, 2025

ProExpertProg left a comment

Choose a reason for hiding this comment

ProExpertProg Apr 25, 2025

Choose a reason for hiding this comment

ProExpertProg Apr 25, 2025

Choose a reason for hiding this comment

ProExpertProg Apr 25, 2025

Choose a reason for hiding this comment

gshtras Apr 25, 2025

Choose a reason for hiding this comment

houseroad May 1, 2025

Choose a reason for hiding this comment

gshtras May 1, 2025

Choose a reason for hiding this comment

ProExpertProg May 5, 2025

Choose a reason for hiding this comment

gshtras commented Apr 24, 2025 •

edited by github-actions bot

Loading