-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… in tensor->extra, make CI passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NeoZhangJianyu I was finishing a review and just clicked the merge button. I disagree with some of your approaches for the backend. Please at least answer the comments I am leaving here.
// int nsm; // number of streaming multiprocessors | ||
// size_t smpb; // max. shared memory per block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no use for this right now, it could be good to start removing these as well.
arch == syclex::architecture::intel_gpu_arl_u || | ||
arch == syclex::architecture::intel_gpu_arl_s || | ||
arch == syclex::architecture::intel_gpu_arl_h || | ||
arch == syclex::architecture::intel_gpu_bmg_g21 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have access to a BMG gpu, I'll reply later with perf numbers, since I guess you'd want to add them to the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alcpz - did you manage to get BMG GPU perf numbers?
const dfloat d = (const dfloat)*((const sycl::half*)d_ptr+ib); | ||
|
||
const int vui = *((const uint8_t *)qs+iqs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall contribution is great. I was doing similar work for the Q4_K quantization, and this is quite helpful.
#ifdef GGML_SYCL_F16 | ||
// v = v - {8.0f, 8.0f}; | ||
// v = v * {d, d}; | ||
v.s0() = (v.s0() - 8.0f) * d; | ||
v.s1() = (v.s1() - 8.0f) * d; | ||
|
||
#else | ||
v.x() = (v.x() - 8.0f) * d; | ||
v.y() = (v.y() - 8.0f) * d; | ||
#endif // GGML_SYCL_F16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the new code has the same functionality, with very minor differences in how dawta is accessed. I worry about the combinatorial explosion of having duplicated code in order to maintain support for all non-reordered and reordered quants. Long term, I think it's best for the backend to avoid going in this direction.
const int tid = item_ct1.get_local_id(2); | ||
|
||
|
||
const int ncols_left = ncols % (QK4_0*WARP_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see multiple references to QK4_0
directly in here instead of using qk
. Is this intended?
My understanding is that you chose the block size to distribute work nicely between the threads and then process the non-aligned columns later, but I was expecting it to be based on the templated qk
, and not on a specific quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that below you only have this enabled for Q4_0
, I still think this could be generalized.
@@ -3570,6 +3389,7 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | |||
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); |
@@ -4251,10 +4071,72 @@ catch (sycl::exception const &exc) { | |||
std::exit(1); | |||
} | |||
|
|||
void reorder_qw(char *data_device, const int ncols, const int nrows, | |||
size_t size, size_t offset, dpct::queue_ptr stream) { | |||
auto tmp_buf = sycl::malloc_shared<char>(size, *stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_buf
doesn't seem to be used on the host side. Unless I am missing something, this can be a device memory pointer.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | ||
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | |
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); | |
GGML_ASSERT(size % sizeof(block_q4_0) == 0); | |
GGML_ASSERT(offset % sizeof(block_q4_0) == 0); |
These are also probably nicer at the beginning of the function, no point on allocating device memory and copying data if an error is found.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | ||
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); | ||
int offset_blks = offset / sizeof(block_q4_0); | ||
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;; | |
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2; |
static ggml_status ggml_backend_sycl_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) { | ||
ggml_backend_sycl_context * sycl_ctx = (ggml_backend_sycl_context *)backend->context; | ||
ggml_sycl_set_main_device(sycl_ctx->device); | ||
|
||
if (!g_ggml_sycl_disable_optimize) optimize_graph_once(cgraph, sycl_ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to #8432 (mentioning because @luoyu-intel was part of the discussion), the suggested approach for having different data layouts is to modify both the set and get tensors from the backend. If we did this, we would not need to do this data reorganization. Have you looked into this approach to see if it was possible?
@Alcpz I am worried that this PR will break RPC(since it adds tensor->extra in the init tensor function again). I added a review but it was never attended for some reason. |
@qnixsynapse , if you can test if it works or not, I am happy to revert it if needed. |
I do not have multiple PCs with Intel GPUs unfortunately. But it was previous disabled because of the mentioned reason. Please see #5277 (reply in thread) . |
I will find out if I can set up something on my side. Thanks for your help! |
@qnixsynapse I've confirmed that the RPC server is broken by this build. Thank you for your pointer:
As you suspected, due to the extra field in the tensors. If I set Edit: Added working logs
|
@Alcpz Thank you so much for taking the initiative to test it.
Yes indeed. |
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
@NeoZhangJianyu - why do you not use the SYCL joint_matrix extension here? |
Joint_matrix is good for matrix * matrix. Welcome you try with joint_matrix and contribute better code. |
Won't joint_matrix do the right thing (which here is: compile down
to DPAS/DPASW instructions) even if some matrix dimensions are 1,
i.e. vectors?
Alternatively: is there a way to directly write kernels that will
use DPAS via SYCL, but not using joint_matrix?
Won't that lead to significantly improved performance?
…On Tue, Mar 25, 2025 at 07:08:40PM -0700, Neo Zhang Jianyu wrote:
NeoZhangJianyu left a comment (ggml-org/llama.cpp#12035)
> @NeoZhangJianyu - why do you not use the SYCL joint_matrix extension here?
Joint_matrix is good for matrix * matrix.
In this case, it's matrix * vector.
Welcome you try with joint_matrix and contribute better code.
--
Reply to this email directly or view it on GitHub:
#12035 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
Before draft the code to show good performance, we don't know which tech is better. |
Optimize MUL_MAT Q4_0 on Intel GPU.
execute to reorder once during compute graph.
It will be shown in startup:
Running with Environment Variables:
It's passed in local CI.
Here is the performance increasing on Intel GPUs (dGPU and iGPU since MTL):
(Test with llama-2-7b.Q4_0.gguf)
For iGPU which is older than MTL, the optimize doesn't increase performance. Skip to support them. Need more study.
For none Intel GPU, the GPU optimize feature detect doesn't support none Intel GPUs in code.
It's unknown the code is good on none Intel GPUs. It could be verified.
I hope this optimize is the seed of optimization on Intel GPU.
This solution is not the better solution on Intel GPU.
There is still huge potential of Intel GPU. Need more study work in the feature.
Thank @luoyu-intel and @airMeng for the solution contribution.