Skip to content

Misc. bug: Large performance regression since version b4365 #10977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GlasslessPizza opened this issue Dec 25, 2024 · 3 comments
Closed

Misc. bug: Large performance regression since version b4365 #10977

GlasslessPizza opened this issue Dec 25, 2024 · 3 comments

Comments

@GlasslessPizza
Copy link

Name and Version

b4365 onward

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

I'm observing a slowdown between b4363 and b4365 that persists to this day. I tried two models:

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/blob/main/gemma-2-27b-it-Q5_K_L.gguf
https://huggingface.co/tensorblock/Qwen2.5-32B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-32B-Instruct-abliterated-Q5_K_M.gguf

Results:

      |   qwen   |   gemma
-----------------------------
b4363 | 31.7 t/s | 36.1 t/s
b4365 | 24.5 t/s | 22.7 t/s
-----------------------------
      |  -23%    |  -37%

Command used:

.\llama-server.exe --model <model> --ctx-size 8192 --threads 10 --no-mmap --mlock --n-gpu-layers 999 --log-disable --flash-attn --cache-type-k q8_0 --cache-type-v q8_0

Windows 10

First Bad Commit

between b4363 and b4365

Relevant log output

No response

@slaren
Copy link
Member

slaren commented Dec 25, 2024

How are you measuring the performance? What queries are you performing? The only relevant commits that I see in that range is #10783, if you are requesting token probabilities the change in performance may be expected.

@GlasslessPizza
Copy link
Author

How are you measuring the performance? What queries are you performing? The only relevant commits that I see in that range is #10783, if you are requesting token probabilities the change in performance may be expected.

The query is a basic q&a task in mikupad. I'm using it's token speed counter to measure. Now that you mention it, i know that Mikupad does request token probabilities internally as some functions use them like "show on hover", but I personally keep it on "hide" as I don't need them.

Copy link
Contributor

github-actions bot commented Feb 8, 2025

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants