Skip to content

DeepSeek FA support (CPU only) #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 11, 2025
Merged

DeepSeek FA support (CPU only) #200

merged 3 commits into from
Feb 11, 2025

Conversation

ikawrakow
Copy link
Owner

This PR adds FA support for models where K and V head sizes are different, such as DeepSeek-R1 and DeepSeek-Lite. It only works with the standard attention mechanism, I have yet to look into FA with MLA.

We get a nice speedup for PP, increasing with context length, but TG is not faster. I want to play some more with it, but throwing it out there if someone wants to try. For sure this allows longer contexts to be processed as -ctk q8_0 -ctv q8_0 seems perfectly adequate.

This is relevant for DeepSeek models.
At this point ggml CPU FA works.
Now I need to go and change iqk FA to make it work
with Dk != Dv.
To not have compilation time explode, just
Dk = 192, Dv = 128 for now (DeepSeek)
@saood06 saood06 mentioned this pull request Feb 10, 2025
3 tasks
@ikawrakow
Copy link
Owner Author

So, I did get some minor FA speed improvements for TG, but I don't see what else one could do, so I'll merge it.

Here is a performance comparison between baseline (Q8_0 K-cache, no FA, no MLA), MLA (Q8_0 K-cache) and FA (Q8_0 K and V cache) for DeepSeek-Lite running on a Ryzen-7950X CPU. Both graphs show the MLA and FA performance ratio to baseline.

First graph is prompt processing speed. We see FA giving a ~40% performance boost at 16k tokens compared to baseline. MLA is 2X slower than baseline and 2.8X slower than FA at 16k tokens.

ds2_pp

The second graph is token generation speed (TG-64) after a prompt of a given length (i.e., TG speed as a function of the number of tokens in the KV cache). We do get some performance gains for very long prompts from FA (~10% at 16k tokens), but by far not as much as from MLA. MLA is 1.57X faster than baseline and 1.43X faster than FA at 16k tokens.

ds2_tg

@ikawrakow
Copy link
Owner Author

ikawrakow commented Feb 11, 2025

Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use llama-bench -pg Npp,Ntg, where Ntg=0.1*Npp is the number of generated tokens and Npp is the number of prompt tokens. The following graph shows PG performance as a function of prompt length. The black symbols are mainline llama.cpp build b9ab0a4d (4687) (most current version as of today), the red symbols are for baseline ik_llama.cpp (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with IQ4_XS. All use Q8_0 for K cache, FA uses Q8_0 also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that Ntg ~ 0.1*Npp is the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline llama.cpp is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.

ds2_pg

@ikawrakow ikawrakow merged commit 3c98bfb into main Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants