FA3 Decode Perf - Use single mma warp group for decode batches #63

LucasWilkinson · 2025-04-18T13:49:51Z

Boost decode performance by using only 1 mma warp group to reduce wasted compute (i.e. use kBlockM == 64 instead of 128)

Batch Size == 1 Decode Perf

Main

                           FlashAttn vs FlashInfer Timing (ms)                           
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Type   ┃ Name        ┃ S     ┃ P  ┃ FlashAttn (ms) ┃ FlashInfer (ms) ┃ (FA-FI)/FI (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ decode │ qwen32b-tp2 │ 1024  │ 1  │ 0.0148         │ 0.0123          │ 20.30          │
│ decode │ qwen32b-tp2 │ 2048  │ 1  │ 0.0156         │ 0.0142          │ 9.39           │
│ decode │ qwen32b-tp2 │ 4096  │ 1  │ 0.0165         │ 0.0159          │ 3.78           │
│ decode │ qwen32b-tp2 │ 8192  │ 1  │ 0.0191         │ 0.0188          │ 1.36           │
│ decode │ qwen32b-tp2 │ 16384 │ 1  │ 0.0263         │ 0.0251          │ 4.81           │
│ decode │ qwen32b-tp2 │ 1024  │ 16 │ 0.0153         │ 0.0129          │ 18.36          │
│ decode │ qwen32b-tp2 │ 2048  │ 16 │ 0.0155         │ 0.0141          │ 9.29           │
│ decode │ qwen32b-tp2 │ 4096  │ 16 │ 0.0165         │ 0.0158          │ 4.39           │
│ decode │ qwen32b-tp2 │ 8192  │ 16 │ 0.0191         │ 0.0187          │ 1.77           │
│ decode │ qwen32b-tp2 │ 16384 │ 16 │ 0.0257         │ 0.0245          │ 5.06           │
│ decode │ qwen32b-tp1 │ 1024  │ 1  │ 0.0146         │ 0.0140          │ 3.61           │
│ decode │ qwen32b-tp1 │ 2048  │ 1  │ 0.0159         │ 0.0157          │ 1.37           │
│ decode │ qwen32b-tp1 │ 4096  │ 1  │ 0.0186         │ 0.0184          │ 0.97           │
│ decode │ qwen32b-tp1 │ 8192  │ 1  │ 0.0249         │ 0.0245          │ 1.45           │
│ decode │ qwen32b-tp1 │ 16384 │ 1  │ 0.0368         │ 0.0361          │ 1.91           │
│ decode │ qwen32b-tp1 │ 1024  │ 16 │ 0.0144         │ 0.0138          │ 4.57           │
│ decode │ qwen32b-tp1 │ 2048  │ 16 │ 0.0157         │ 0.0155          │ 1.28           │
│ decode │ qwen32b-tp1 │ 4096  │ 16 │ 0.0185         │ 0.0183          │ 1.36           │
│ decode │ qwen32b-tp1 │ 8192  │ 16 │ 0.0247         │ 0.0241          │ 2.43           │
│ decode │ qwen32b-tp1 │ 16384 │ 16 │ 0.0363         │ 0.0352          │ 3.20           │
│ decode │ llama8b-tp1 │ 1024  │ 1  │ 0.0145         │ 0.0139          │ 3.92           │
│ decode │ llama8b-tp1 │ 2048  │ 1  │ 0.0158         │ 0.0156          │ 0.70           │
│ decode │ llama8b-tp1 │ 4096  │ 1  │ 0.0185         │ 0.0184          │ 0.90           │
│ decode │ llama8b-tp1 │ 8192  │ 1  │ 0.0247         │ 0.0245          │ 0.85           │
│ decode │ llama8b-tp1 │ 16384 │ 1  │ 0.0365         │ 0.0361          │ 1.09           │
│ decode │ llama8b-tp1 │ 1024  │ 16 │ 0.0147         │ 0.0139          │ 5.99           │
│ decode │ llama8b-tp1 │ 2048  │ 16 │ 0.0158         │ 0.0155          │ 1.55           │
│ decode │ llama8b-tp1 │ 4096  │ 16 │ 0.0185         │ 0.0183          │ 1.49           │
│ decode │ llama8b-tp1 │ 8192  │ 16 │ 0.0243         │ 0.0239          │ 1.69           │
│ decode │ llama8b-tp1 │ 16384 │ 16 │ 0.0361         │ 0.0351          │ 2.93           │
└────────┴─────────────┴───────┴────┴────────────────┴─────────────────┴────────────────┘

PR

                           FlashAttn vs FlashInfer Timing (ms)                           
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Type   ┃ Name        ┃ S     ┃ P  ┃ FlashAttn (ms) ┃ FlashInfer (ms) ┃ (FA-FI)/FI (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ decode │ qwen32b-tp2 │ 1024  │ 1  │ 0.0128         │ 0.0124          │ 2.97           │
│ decode │ qwen32b-tp2 │ 2048  │ 1  │ 0.0149         │ 0.0143          │ 4.45           │
│ decode │ qwen32b-tp2 │ 4096  │ 1  │ 0.0159         │ 0.0159          │ -0.04          │
│ decode │ qwen32b-tp2 │ 8192  │ 1  │ 0.0183         │ 0.0190          │ -3.30          │
│ decode │ qwen32b-tp2 │ 16384 │ 1  │ 0.0251         │ 0.0254          │ -1.21          │
│ decode │ qwen32b-tp2 │ 1024  │ 16 │ 0.0133         │ 0.0129          │ 2.73           │
│ decode │ qwen32b-tp2 │ 2048  │ 16 │ 0.0149         │ 0.0142          │ 4.94           │
│ decode │ qwen32b-tp2 │ 4096  │ 16 │ 0.0159         │ 0.0158          │ 0.22           │
│ decode │ qwen32b-tp2 │ 8192  │ 16 │ 0.0183         │ 0.0188          │ -2.79          │
│ decode │ qwen32b-tp2 │ 16384 │ 16 │ 0.0249         │ 0.0246          │ 1.41           │
│ decode │ qwen32b-tp1 │ 1024  │ 1  │ 0.0134         │ 0.0140          │ -4.86          │
│ decode │ qwen32b-tp1 │ 2048  │ 1  │ 0.0151         │ 0.0157          │ -4.15          │
│ decode │ qwen32b-tp1 │ 4096  │ 1  │ 0.0178         │ 0.0185          │ -3.95          │
│ decode │ qwen32b-tp1 │ 8192  │ 1  │ 0.0238         │ 0.0248          │ -4.07          │
│ decode │ qwen32b-tp1 │ 16384 │ 1  │ 0.0349         │ 0.0361          │ -3.37          │
│ decode │ qwen32b-tp1 │ 1024  │ 16 │ 0.0133         │ 0.0139          │ -4.34          │
│ decode │ qwen32b-tp1 │ 2048  │ 16 │ 0.0150         │ 0.0156          │ -3.80          │
│ decode │ qwen32b-tp1 │ 4096  │ 16 │ 0.0177         │ 0.0184          │ -3.73          │
│ decode │ qwen32b-tp1 │ 8192  │ 16 │ 0.0234         │ 0.0241          │ -2.88          │
│ decode │ qwen32b-tp1 │ 16384 │ 16 │ 0.0349         │ 0.0353          │ -1.09          │
│ decode │ llama8b-tp1 │ 1024  │ 1  │ 0.0134         │ 0.0140          │ -4.23          │
│ decode │ llama8b-tp1 │ 2048  │ 1  │ 0.0150         │ 0.0157          │ -4.49          │
│ decode │ llama8b-tp1 │ 4096  │ 1  │ 0.0176         │ 0.0185          │ -4.61          │
│ decode │ llama8b-tp1 │ 8192  │ 1  │ 0.0236         │ 0.0246          │ -4.38          │
│ decode │ llama8b-tp1 │ 16384 │ 1  │ 0.0349         │ 0.0360          │ -3.15          │
│ decode │ llama8b-tp1 │ 1024  │ 16 │ 0.0132         │ 0.0139          │ -5.06          │
│ decode │ llama8b-tp1 │ 2048  │ 16 │ 0.0150         │ 0.0156          │ -4.07          │
│ decode │ llama8b-tp1 │ 4096  │ 16 │ 0.0177         │ 0.0184          │ -3.80          │
│ decode │ llama8b-tp1 │ 8192  │ 16 │ 0.0235         │ 0.0241          │ -2.48          │
│ decode │ llama8b-tp1 │ 16384 │ 16 │ 0.0348         │ 0.0350          │ -0.60          │
└────────┴─────────────┴───────┴────┴────────────────┴─────────────────┴────────────────┘

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

tlrmchlsmth · 2025-04-21T22:31:12Z

hopper/flash_api.cpp

-    int const seqlen_k = !is_varlen_k ? (!paged_KV ? k.size(1) : max_num_pages_per_seq * page_size) : max_seqlen_k_.value();
+    int const seqlen_k = !max_seqlen_k_.has_value() ? (!paged_KV ? k.size(1) : max_num_pages_per_seq * page_size) : max_seqlen_k_.value();


what's up with this change?

When seqused_k is used (which is what's required for paged kv-caches) instead of cu_seqlens_k is_varlen_k is false but we frequently have max_seqlen_k, so using that instead here prevents us from overestimating the number of splits. max_seqlen_k is also what the aot scheduler uses so this resolves this mismatch, meaning we end up picking a more efficient combine kernel (tighter num_split bound). I thinks this line is actually worth upstreaming, good catch!

single wg for decode

e93779c

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson mentioned this pull request Apr 18, 2025

[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 vllm-project/vllm#16864

Merged

LucasWilkinson added 3 commits April 18, 2025 20:42

make sure we determine splits with the right params

8b6d7eb

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

unify heuristic

e46f094

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

don't over estimate splits

f4fc71e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson changed the title ~~[WIP] Use single mma warp group for decode batches~~ Decode Perf - Use single mma warp group for decode batches Apr 21, 2025

LucasWilkinson marked this pull request as ready for review April 21, 2025 22:14

LucasWilkinson changed the title ~~Decode Perf - Use single mma warp group for decode batches~~ FA3 Decode Perf - Use single mma warp group for decode batches Apr 21, 2025

tlrmchlsmth reviewed Apr 21, 2025

View reviewed changes

tlrmchlsmth approved these changes Apr 21, 2025

View reviewed changes

tlrmchlsmth merged commit 13d0fd9 into main Apr 21, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 Decode Perf - Use single mma warp group for decode batches #63

FA3 Decode Perf - Use single mma warp group for decode batches #63

LucasWilkinson commented Apr 18, 2025 •

edited

Loading

tlrmchlsmth Apr 21, 2025

LucasWilkinson Apr 21, 2025

		int const seqlen_k = !is_varlen_k ? (!paged_KV ? k.size(1) : max_num_pages_per_seq * page_size) : max_seqlen_k_.value();
		int const seqlen_k = !max_seqlen_k_.has_value() ? (!paged_KV ? k.size(1) : max_num_pages_per_seq * page_size) : max_seqlen_k_.value();

FA3 Decode Perf - Use single mma warp group for decode batches #63

FA3 Decode Perf - Use single mma warp group for decode batches #63

Conversation

LucasWilkinson commented Apr 18, 2025 • edited Loading

Batch Size == 1 Decode Perf

Main

PR

tlrmchlsmth Apr 21, 2025

Choose a reason for hiding this comment

LucasWilkinson Apr 21, 2025

Choose a reason for hiding this comment

LucasWilkinson commented Apr 18, 2025 •

edited

Loading