CUDA: Prefer vector flash decoding kernel for Gemma models #12738

gaugarg-nv · 2025-04-03T12:59:29Z

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by up to 12% in gen phase throughput for Gemma models.

Performance:

RTX 4090, CUDA 12.8, Master vs PR

	ISL	OSL	Master: Gen phase tok/sec	PR: Gen phase tok/sec	Speed-up
gemma3 1B Q4_K - Medium	10	200	318.6111	333.6684	1.047259
	100	200	309.473	328.0762	1.060113
	1000	200	284.6962	319.3516	1.121728
	10000	200	183.7296	206.1121	1.121823

gemma3 4B Q4_K - Medium
	10	200	175.7797	184.4036	1.049061
	100	200	174.9861	181.8483	1.039215
	1000	200	165.9151	175.7443	1.059242
	10000	200	120.0141	126.6009	1.054884

gemma3 12B Q4_K - Medium
	10	200	83.11534	85.4468	1.028051
	100	200	82.62634	84.6703	1.024737
	1000	200	80.07223	81.96644	1.023656
	10000	200	56.99771	59.67587	1.046987

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

JohannesGaessler

Thank you, I probably forgot to adapt the logic at some point.

ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Prefer vector flash decoding kernel for Gemma models

f7d07dd

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

gaugarg-nv requested a review from JohannesGaessler as a code owner April 3, 2025 12:59

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2025

JohannesGaessler approved these changes Apr 3, 2025

View reviewed changes

ggml/src/ggml-cuda/fattn.cu Outdated Show resolved Hide resolved

Update ggml/src/ggml-cuda/fattn.cu

ce71aba

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler merged commit c262bed into ggml-org:master Apr 3, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

gaugarg-nv commented Apr 3, 2025 •

edited

Loading

JohannesGaessler left a comment

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

Conversation

gaugarg-nv commented Apr 3, 2025 • edited Loading

JohannesGaessler left a comment

Choose a reason for hiding this comment

gaugarg-nv commented Apr 3, 2025 •

edited

Loading