-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Low performance with Sycl Backend #5480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@chsasank We encourage all developers to engage this activity. Thank you! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@NeoZhangJianyu , what is the best way to get involved in this effort? From the performance testing I've done the A770 is running about 2 to 3 times slower than my M1 Max. If there's anything I could directly to do help improve the SYCL backend, I'd love to contribute. I just don't know where to begin. |
@jwhitehorn
Both powerful technology are not used in SYCL backend by now. Now, I'm still working on the function and bug fix, instead of performance. If you want do something for performance, you could profile the bottleneck of SYCL backend by Intel Vtune (included in Intel oneAPI base toolkit). Then optimize the most hot function firstly. Thank you! |
@NeoZhangJianyu - reading your message (which will soon be a year old now) has made me terribly sad:
At the very least, if you're not going to fix the massive performance issues, you could add a huge banner somewhere THE USE OF INTEL ARC PRODUCTS FOR LLAMA.CPP APPLICATIONS IS NOT RECOMMENDED. |
@ky438 Who are you? Use llama.cpp or not, it's depended on user self. Please don't spread the negative words here. This is open source project. |
Arguments like "you don't have to use it", "we are not paid
to build it", haven't stopped many high quality open source
projects from flourishing, including, ironically, much of
the software stack upon which SYCL is built, and indeed
much of the llama.cpp itself.
The performance of llama.cpp on Intel GPUs is _terrible_
and I don't think it's particularly helpful to pretend
otherwise, or to ask people not to talk about it.
I don't know how to answer "who are you?" - do you want
to reach me by email directly? I don't want to make my
email or phone number public due to spam.
…On Tue, Mar 25, 2025 at 07:20:34PM -0700, Neo Zhang Jianyu wrote:
NeoZhangJianyu left a comment (ggml-org/llama.cpp#5480)
@ky438 Who are you?
I don't see any profile info of this github account.
Use llama.cpp or not, it's depended on user self.
Please don't spread the negative words here.
This is open source project.
We are private contributor in spare time.
--
Reply to this email directly or view it on GitHub:
#5480 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
No, I don't want to reach you. |
I am working on ollama/ollama#2458 and did some benchmarks to test the performance. I compiled with commit id
3bdc4cd0
. Build segfaults with master as in #5469I used mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s. My llama-bench command is
On M2 Air
On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)
On Arc 770
Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low.
This is much slower than what I expected because Arc 770 is significantly faster than both M2 and 12400. You can see the benchmarks of FLOPs and BW here: https://github.com/chsasank/device-benchmarks
The text was updated successfully, but these errors were encountered: