Low performance with Sycl Backend #5480

chsasank · 2024-02-13T16:49:52Z

I am working on ollama/ollama#2458 and did some benchmarks to test the performance. I compiled with commit id 3bdc4cd0. Build segfaults with master as in #5469

I used mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s. My llama-bench command is

./build/bin/llama-bench -m models/mistral-7b-v0.1.Q4_0.gguf -p 128,256,512 -n 128,256,512

On M2 Air

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 128	144.47 ± 0.22
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 256	142.95 ± 1.17
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 512	141.36 ± 0.67
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	tg 128	20.06 ± 0.66
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	tg 256	20.26 ± 0.17
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	tg 512	13.96 ± 1.62

On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)

model	size	params	backend	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	pp 128	18.60 ± 3.07
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	pp 256	20.82 ± 0.14
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	pp 512	22.48 ± 0.16
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	tg 128	10.78 ± 0.02
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	tg 256	10.76 ± 0.02
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	tg 512	10.69 ± 0.01

On Arc 770

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	pp 128	407.14 ± 58.05
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	pp 256	583.57 ± 78.24
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	pp 512	757.99 ± 1.48
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	tg 128	24.74 ± 0.27
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	tg 256	24.65 ± 0.20
llama 7B Q4_0	3.83 GiB	7.24 B	SYCL	99	tg 512	21.46 ± 2.39

Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low.

This is much slower than what I expected because Arc 770 is significantly faster than both M2 and 12400. You can see the benchmarks of FLOPs and BW here: https://github.com/chsasank/device-benchmarks

The text was updated successfully, but these errors were encountered:

NeoZhangJianyu · 2024-03-11T01:20:20Z

@chsasank
Thank your feedback!
Currently, SYCL backend is developed for function issues.
We try best to avoid reducing the performance during development.
But performance optimization is not begun.

We encourage all developers to engage this activity.

Thank you!

github-actions · 2024-04-25T01:12:44Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jwhitehorn · 2024-05-01T23:39:39Z

@chsasank Thank your feedback! Currently, SYCL backend is developed for function issues. We try best to avoid reducing the performance during development. But performance optimization is not begun.

We encourage all developers to engage this activity.

Thank you!

@NeoZhangJianyu , what is the best way to get involved in this effort?

From the performance testing I've done the A770 is running about 2 to 3 times slower than my M1 Max. If there's anything I could directly to do help improve the SYCL backend, I'd love to contribute. I just don't know where to begin.

NeoZhangJianyu · 2024-05-04T14:54:24Z

@jwhitehorn
It's great to know you are interesting in SYCL backend.
M1 is a SOC, but Arc770 is a GPU.
They are not same type device. So it's unfair to compare the performance directly.

Current SYCL backend is using SYCL code to run on EU (or Vector Engine-VXE). In Arc770, there is (Matrix Engine -XMX) not be used by SYCL backend.
esimd is low level develop technology to release Intel GPU performance.

Both powerful technology are not used in SYCL backend by now.
So, I think there is huge potential to improve SYCL backend in Intel GPU.

Now, I'm still working on the function and bug fix, instead of performance.
Because I think most users are run on llama.cpp with single request. That means it service for one client in same time.
due to the human being reading text has a speed limitation, too quick response (like <20ms/token) won't bring more value to single users.
I think the performance of Arc770 are good enough to single user in fact.
That's why I don't put the performance in high priority list.
Of course, I would do some performance tuning step by step in the future.

If you want do something for performance, you could profile the bottleneck of SYCL backend by Intel Vtune (included in Intel oneAPI base toolkit). Then optimize the most hot function firstly.
For LLM, the bottleneck are in both compute and IO bandwidth.

Thank you!

ky438 · 2025-03-25T11:13:57Z

@NeoZhangJianyu - reading your message (which will soon be a year old now) has made me terribly sad:

@jwhitehorn politely asked you for guidance in improving the SYCL backend, and you essentially ignored him: what sort of a person capable of working on this sort of performance-sensitive code would need to be told such basic things as "optimize the most hot function firstly"?
You then continue "Of course, I would do some performance tuning step by step in the future"? Well? Did you tackle the very first point you made, yet? The XMX is unused? Indeed, the Arc770 is now obsolete, an entirely new generation of hardware is on the market for some months now and does it perform well? No, no it does not, it has the same issue @chsasank reported: text generation speed is insanely slow.

At the very least, if you're not going to fix the massive performance issues, you could add a huge banner somewhere THE USE OF INTEL ARC PRODUCTS FOR LLAMA.CPP APPLICATIONS IS NOT RECOMMENDED.

NeoZhangJianyu · 2025-03-26T02:20:12Z

@ky438 Who are you?
I don't see any profile info of this github account.

Use llama.cpp or not, it's depended on user self.

Please don't spread the negative words here.

This is open source project.
We are private contributor in spare time.

ky438 · 2025-03-26T03:25:19Z

Arguments like "you don't have to use it", "we are not paid to build it", haven't stopped many high quality open source projects from flourishing, including, ironically, much of the software stack upon which SYCL is built, and indeed much of the llama.cpp itself. The performance of llama.cpp on Intel GPUs is _terrible_ and I don't think it's particularly helpful to pretend otherwise, or to ask people not to talk about it. I don't know how to answer "who are you?" - do you want to reach me by email directly? I don't want to make my email or phone number public due to spam.

…

On Tue, Mar 25, 2025 at 07:20:34PM -0700, Neo Zhang Jianyu wrote: NeoZhangJianyu left a comment (ggml-org/llama.cpp#5480) @ky438 Who are you? I don't see any profile info of this github account. Use llama.cpp or not, it's depended on user self. Please don't spread the negative words here. This is open source project. We are private contributor in spare time. -- Reply to this email directly or view it on GitHub: #5480 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

NeoZhangJianyu · 2025-03-26T05:39:51Z

No, I don't want to reach you.
If an account is pure anonymous, it won't be responsibility of it's words.

chsasank added the bug-unconfirmed label Feb 13, 2024

chsasank mentioned this issue Feb 14, 2024

Add support for running llama.cpp with SYCL for Intel GPUs ollama/ollama#2458

Open

github-actions bot added the stale label Apr 11, 2024

github-actions bot closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance with Sycl Backend #5480

Low performance with Sycl Backend #5480

chsasank commented Feb 13, 2024

NeoZhangJianyu commented Mar 11, 2024

github-actions bot commented Apr 25, 2024

jwhitehorn commented May 1, 2024

NeoZhangJianyu commented May 4, 2024

ky438 commented Mar 25, 2025

NeoZhangJianyu commented Mar 26, 2025

ky438 commented Mar 26, 2025 via email

NeoZhangJianyu commented Mar 26, 2025

Low performance with Sycl Backend #5480

Low performance with Sycl Backend #5480

Comments

chsasank commented Feb 13, 2024

NeoZhangJianyu commented Mar 11, 2024

github-actions bot commented Apr 25, 2024

jwhitehorn commented May 1, 2024

NeoZhangJianyu commented May 4, 2024

ky438 commented Mar 25, 2025

NeoZhangJianyu commented Mar 26, 2025

ky438 commented Mar 26, 2025 via email

NeoZhangJianyu commented Mar 26, 2025