DeepSeek FA support (CPU only) #200

ikawrakow · 2025-02-10T16:56:40Z

This PR adds FA support for models where K and V head sizes are different, such as DeepSeek-R1 and DeepSeek-Lite. It only works with the standard attention mechanism, I have yet to look into FA with MLA.

We get a nice speedup for PP, increasing with context length, but TG is not faster. I want to play some more with it, but throwing it out there if someone wants to try. For sure this allows longer contexts to be processed as -ctk q8_0 -ctv q8_0 seems perfectly adequate.

This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv.

To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek)

ikawrakow · 2025-02-11T09:08:44Z

So, I did get some minor FA speed improvements for TG, but I don't see what else one could do, so I'll merge it.

Here is a performance comparison between baseline (Q8_0 K-cache, no FA, no MLA), MLA (Q8_0 K-cache) and FA (Q8_0 K and V cache) for DeepSeek-Lite running on a Ryzen-7950X CPU. Both graphs show the MLA and FA performance ratio to baseline.

First graph is prompt processing speed. We see FA giving a ~40% performance boost at 16k tokens compared to baseline. MLA is 2X slower than baseline and 2.8X slower than FA at 16k tokens.

The second graph is token generation speed (TG-64) after a prompt of a given length (i.e., TG speed as a function of the number of tokens in the KV cache). We do get some performance gains for very long prompts from FA (~10% at 16k tokens), but by far not as much as from MLA. MLA is 1.57X faster than baseline and 1.43X faster than FA at 16k tokens.

ikawrakow · 2025-02-11T10:33:34Z

Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use llama-bench -pg Npp,Ntg, where Ntg=0.1*Npp is the number of generated tokens and Npp is the number of prompt tokens. The following graph shows PG performance as a function of prompt length. The black symbols are mainline llama.cpp build b9ab0a4d (4687) (most current version as of today), the red symbols are for baseline ik_llama.cpp (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with IQ4_XS. All use Q8_0 for K cache, FA uses Q8_0 also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that Ntg ~ 0.1*Npp is the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline llama.cpp is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.

Kawrakow added 3 commits February 10, 2025 17:41

Adding support for K head size != V head size

a00bb54

This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv.

iqk support for K head size != V head size

10815e7

To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek)

FA: very slightly faster for nq = 1 (TG)

4066235

saood06 mentioned this pull request Feb 10, 2025

Deepseek MLA Optimizations #180

Closed

3 tasks

ikawrakow merged commit 3c98bfb into main Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek FA support (CPU only) #200

DeepSeek FA support (CPU only) #200

Uh oh!

ikawrakow commented Feb 10, 2025

Uh oh!

ikawrakow commented Feb 11, 2025

Uh oh!

ikawrakow commented Feb 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

DeepSeek FA support (CPU only) #200

DeepSeek FA support (CPU only) #200

Uh oh!

Conversation

ikawrakow commented Feb 10, 2025

Uh oh!

ikawrakow commented Feb 11, 2025

Uh oh!

ikawrakow commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikawrakow commented Feb 11, 2025 •

edited

Loading