Skip to content

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 3, 2025

Conversation

gaugarg-nv
Copy link
Contributor

@gaugarg-nv gaugarg-nv commented Apr 3, 2025

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by up to 12% in gen phase throughput for Gemma models.

Performance:

RTX 4090, CUDA 12.8, Master vs PR

  ISL OSL Master: Gen phase tok/sec PR: Gen phase tok/sec Speed-up
gemma3 1B Q4_K - Medium 10 200 318.6111 333.6684 1.047259
  100 200 309.473 328.0762 1.060113
  1000 200 284.6962 319.3516 1.121728
  10000 200 183.7296 206.1121 1.121823
           
gemma3 4B Q4_K - Medium          
  10 200 175.7797 184.4036 1.049061
  100 200 174.9861 181.8483 1.039215
  1000 200 165.9151 175.7443 1.059242
  10000 200 120.0141 126.6009 1.054884
           
gemma3 12B Q4_K - Medium          
  10 200 83.11534 85.4468 1.028051
  100 200 82.62634 84.6703 1.024737
  1000 200 80.07223 81.96644 1.023656
  10000 200 56.99771 59.67587 1.046987

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I probably forgot to adapt the logic at some point.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@JohannesGaessler JohannesGaessler merged commit c262bed into ggml-org:master Apr 3, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants