-
Notifications
You must be signed in to change notification settings - Fork 11.6k
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Shaders are based on cpy.cu.
stable_diffusion.cpp "works" with this change, but loading the models is way slower and seems to be doing the quantization on the CPU. @stduhpf any idea? |
Thanks a lot! It seems to work very well for supported types.
Are you talking about quantization of the LoRA or of the base model? Because if you're talking about the base model, I think this is the expected behaviour even without those changes, even though it would be nice to quantize using GPU now that the shaders exist for it. It's even single-threaded, so it takes forever with larger models. I didn't notice any slowness loading the LoRA, and it looks like it was using the GPU as expected. |
Yeah, I think it was the base model. Thanks for clarifying. |
In llama.cpp, F32 -> Quant is needed for KV quantization, and Quant -> F32 conversion is used for context shifts when quantizing the K cache. |
We also have GET_ROWS supporting dequant. Does context shifting use CPY or GET_ROWS? |
It uses CPY. |
I started implementing q->f32, noticed there weren't backend tests, and when I added them they error out because ggml_compute_forward_dup doesn't support q->f32. Am I missing something?
|
I've gone ahead and implemented the missing ggml-cpu function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU changes look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and this enables k-cache quantization for Vulkan, very nice. v-cache quantization as well, but only with flash attention.
…l-org#11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination
…l-org#11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination
…l-org#11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination
There's a small problem I noticed regarding the CPY op for q4_1 (and q4_0 ?). It doesn't have a noticable effect when running inference, but when running
|
That kind of issue can happen due to small numerical differences between implementations and devices, in combination with the randomly-generated test data. See for example #11972 |
Randomness is a possibility, but i haven't seen it happen for these tests myself. I wonder if amd could be using a different rounding mode? |
Yes it's probably a rounding error. I wonder if it also happens with the other quants, but the error is just too small to exceed the threshold. |
Can you try adding |
Ok I managed to get it to compile (it requires It does fix the failure in |
Cool. Do you want to make the full fix (IIRC we need to compile two separate versions)? If not, I can try it soon. |
I'm not sure what you mean by that. |
The shader source needs to have the |
#if RTE16
#extension GL_EXT_spirv_intrinsics : enable
spirv_execution_mode(capabilities = [4467], 4462, 16); // RoundingModeRTE, 16 bits
#endif // RTE16 Something like this? |
For the shader, yes, but the other two parts are still necessary because not all implementations necessarily support these rounding modes. |
Shaders are based on cpy.cu.
For #11127.
This supports the same set of quants to be converted from f32 as CUDA. Looks like CUDA also supports OP_CPY for Q8_0 to F32, and for any quant to itself. I don't know if those are required, but they wouldn't be hard to add if so.
I haven't done any perf testing of these. CUDA is also using one thread per CTA, which sounds kind of slow but maybe it's not a perf-critical operation. In fact, the only testing I've done is test-backend-ops. I'll try to pull this into stable-diffusion.cpp to test.
CC @stduhpf