vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

jeffbolznv · 2025-01-09T20:52:12Z

Shaders are based on cpy.cu.

This supports the same set of quants to be converted from f32 as CUDA. Looks like CUDA also supports OP_CPY for Q8_0 to F32, and for any quant to itself. I don't know if those are required, but they wouldn't be hard to add if so.

I haven't done any perf testing of these. CUDA is also using one thread per CTA, which sounds kind of slow but maybe it's not a perf-critical operation. In fact, the only testing I've done is test-backend-ops. I'll try to pull this into stable-diffusion.cpp to test.

CC @stduhpf

Shaders are based on cpy.cu.

jeffbolznv · 2025-01-09T21:08:35Z

stable_diffusion.cpp "works" with this change, but loading the models is way slower and seems to be doing the quantization on the CPU. @stduhpf any idea?

stduhpf · 2025-01-09T22:14:02Z

Thanks a lot! It seems to work very well for supported types.

loading the models is way slower and seems to be doing the quantization on the CPU

Are you talking about quantization of the LoRA or of the base model? Because if you're talking about the base model, I think this is the expected behaviour even without those changes, even though it would be nice to quantize using GPU now that the shaders exist for it. It's even single-threaded, so it takes forever with larger models.

I didn't notice any slowness loading the LoRA, and it looks like it was using the GPU as expected.

jeffbolznv · 2025-01-10T00:31:40Z

Yeah, I think it was the base model. Thanks for clarifying.

slaren · 2025-01-10T00:35:15Z

In llama.cpp, F32 -> Quant is needed for KV quantization, and Quant -> F32 conversion is used for context shifts when quantizing the K cache.

jeffbolznv · 2025-01-10T01:08:39Z

We also have GET_ROWS supporting dequant. Does context shifting use CPY or GET_ROWS?

slaren · 2025-01-10T01:34:41Z

It uses CPY.

jeffbolznv · 2025-01-10T02:43:19Z

I started implementing q->f32, noticed there weren't backend tests, and when I added them they error out because ggml_compute_forward_dup doesn't support q->f32. Am I missing something?

  CPY(type_src=q4_0,type_dst=f32,ne=[256,4,4,4],permute=[0,0,0,0]): C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:3996: fatal error

jeffbolznv · 2025-01-10T04:23:48Z

I've gone ahead and implemented the missing ggml-cpu function.

ggml/src/ggml-cpu/ggml-cpu.c

slaren

The CPU changes look good.

0cc4m

Looks good, and this enables k-cache quantization for Vulkan, very nice. v-cache quantization as well, but only with flash attention.

…l-org#11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination

stduhpf · 2025-03-20T12:44:32Z

There's a small problem I noticed regarding the CPY op for q4_1 (and q4_0 ?). It doesn't have a noticable effect when running inference, but when running test-backend-ops, I sometime get failures with NSME barely above the threshold.
Here are the outputs of a few test-backend-ops runs, filtered for failures:

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none

  CPY(type_src=f32,type_dst=q4_1,ne=[256,4,4,4],permute=[0,0,0,0]): [CPY] NMSE = 0.000001012 > 0.000001000 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001093 > 0.000001000 FAIL
  Backend Vulkan0: FAIL
  CPY(type_src=f32,type_dst=q4_0,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000007574 > 0.000001000 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,4,4,4],permute=[0,0,0,0]): [CPY] NMSE = 0.000001096 > 0.000001000 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001126 > 0.000001000 FAIL
  Backend Vulkan1: FAIL
FAIL

  CPY(type_src=f32,type_dst=q4_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001085 > 0.000001000 FAIL
  Backend Vulkan0: FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,4,4,4],permute=[0,0,0,0]): [CPY] NMSE = 0.000001035 > 0.000001000 FAIL
  Backend Vulkan1: FAIL
FAIL

  CPY(type_src=f32,type_dst=q4_1,ne=[256,4,4,4],permute=[0,0,0,0]): [CPY] NMSE = 0.000001088 > 0.000001000 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001006 > 0.000001000 FAIL
  Backend Vulkan0: FAIL
FAIL

  CPY(type_src=f32,type_dst=q4_1,ne=[256,4,4,4],permute=[0,0,0,0]): [CPY] NMSE = 0.000001005 > 0.000001000 FAIL
  Backend Vulkan0: FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001033 > 0.000001000 FAIL
  Backend Vulkan1: FAIL
FAIL

0cc4m · 2025-03-20T12:51:44Z

That kind of issue can happen due to small numerical differences between implementations and devices, in combination with the randomly-generated test data. See for example #11972

jeffbolznv · 2025-03-20T13:10:36Z

Randomness is a possibility, but i haven't seen it happen for these tests myself. I wonder if amd could be using a different rounding mode?

stduhpf · 2025-03-20T13:37:14Z

Yes it's probably a rounding error. I wonder if it also happens with the other quants, but the error is just too small to exceed the threshold.

jeffbolznv · 2025-03-20T14:02:51Z

Can you try adding spirv_execution_mode(capabilities = [4467], 4462, 16); // RoundingModeRTE, 16 bits in the shader?

stduhpf · 2025-03-20T14:16:35Z

Ok I managed to get it to compile (it requires #extension GL_EXT_spirv_intrinsics : enable)

It does fix the failure in test-backend-ops, nice find.

jeffbolznv · 2025-03-20T15:28:23Z

Cool. Do you want to make the full fix (IIRC we need to compile two separate versions)? If not, I can try it soon.

stduhpf · 2025-03-20T18:18:39Z

IIRC we need to compile two separate versions

I'm not sure what you mean by that.

jeffbolznv · 2025-03-20T18:22:34Z

The shader source needs to have the #if RTE16, vulkan-shaders-gen.cpp compiles two versions, with and without RTE16, and then ggml-vulkan.cpp selects which one to use based on float_controls_rte_fp16.

stduhpf · 2025-03-20T18:24:56Z

#if RTE16
#extension GL_EXT_spirv_intrinsics : enable
spirv_execution_mode(capabilities = [4467], 4462, 16); // RoundingModeRTE, 16 bits
#endif // RTE16

Something like this?

jeffbolznv · 2025-03-20T18:35:23Z

For the shader, yes, but the other two parts are still necessary because not all implementations necessarily support these rounding modes.

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

924bccc

Shaders are based on cpy.cu.

jeffbolznv requested a review from 0cc4m January 9, 2025 20:52

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 9, 2025

vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32

305dc66

github-actions bot added the testing Everything test related label Jan 10, 2025

slaren reviewed Jan 11, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu.c Show resolved Hide resolved

ggml: copy q->f32 assumes some contiguity in the destination

0a982a4

slaren approved these changes Jan 13, 2025

View reviewed changes

0cc4m approved these changes Jan 16, 2025

View reviewed changes

0cc4m merged commit bd38dde into ggml-org:master Jan 16, 2025
48 checks passed

stduhpf mentioned this pull request Mar 20, 2025

Vulkan: RTE rounding for cpy to quant #12480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

jeffbolznv commented Jan 9, 2025 •

edited

Loading

jeffbolznv commented Jan 9, 2025

stduhpf commented Jan 9, 2025

jeffbolznv commented Jan 10, 2025

slaren commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

slaren commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

slaren left a comment

0cc4m left a comment

stduhpf commented Mar 20, 2025

0cc4m commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025 •

edited

Loading

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

Conversation

jeffbolznv commented Jan 9, 2025 • edited Loading

jeffbolznv commented Jan 9, 2025

stduhpf commented Jan 9, 2025

jeffbolznv commented Jan 10, 2025

slaren commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

slaren commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

jeffbolznv commented Jan 10, 2025

slaren left a comment

Choose a reason for hiding this comment

0cc4m left a comment

Choose a reason for hiding this comment

stduhpf commented Mar 20, 2025

0cc4m commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025 • edited Loading

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

stduhpf commented Mar 20, 2025

jeffbolznv commented Mar 20, 2025

jeffbolznv commented Jan 9, 2025 •

edited

Loading

stduhpf commented Mar 20, 2025 •

edited

Loading