-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Adept Persimmon Models not working with CUDA Acceleration #4038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you using a very recent version? There was a Persimmon fix merged just yesterday: #4010 |
Yes, i was using that verry version #4010 on windows 10. It IS working with |
Wait - maybe i found an error on my pilpeline, causing me to checkout an old version although i already pulled the new one. |
I just downloaded https://huggingface.co/maddes8cht/adept-persimmon-8b-base-gguf/blob/main/adept-persimmon-8b-base-Q4_K_M.gguf and can reproduce your issue with the latest So don't worry about that. I'm looking at this. |
okay, you can reproduce the error with the files i created, but I`m not sure right now if they where actually created by #4010.. |
#4010 only changed evaluating the models, nothing that would affect converting/producing them. |
Okay, then thanks for having a look... |
So the problem seems to be that there's no CUDA kernel for ReLU. I tried adding one but weirdly it still doesn't work. I might be able to submit a pull to fix this. |
@maddes8cht Please give #4041 a try. My very first CUDA kernels! (Of course I "wrote" them by cut-and-pasting other working ones and changing some simple stuff but we won't bring that little detail up.) Note you'll only be able to use |
@KerfuffleV2
with 37 i get
And with more than 37 layers it does not work at all. |
I don't really know enough myself to fix it, so I guess the answer is it depends on whether someone else helps with figuring out how to offload those last two KV cache layers. I think the problem is that the CUDA |
So for a final solution we would probably have to invite @JohannesGaessler to have a look at it? As an intermediate solution it would be fine to never try to offload more than said 37 layers with a persimmon model. That would still be better than to crash, and should be fine for a merge. |
I actually looked into doing that and as far as I could see, there isn't a way to restrict the Having to manually set |
I currently do not have the time to work on llama.cpp. At the earliest I will have more time in January but even then I have other priorities than this model. |
@maddes8cht #4041 will be merged. I was able to make the error message more helpful, but GG said it should crash just so we don't forget about the problem which makes sense. It's easy to ignore stuff that doesn't cause any pain.
So for now, it seems like you'll be able to offload at most the repeating layers + 1. There may be a way to refactor the Persimmon graph to avoid these problems but that's not really something I can help with. We should leave this issue open until the problem is fully resolved. quick edit: A bit off topic, but have you tried with OpenCL? I just get garbage output for anything higher than the repeating layers. So |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I have successfully gguf-converted the base and chat variants of the Adept Persimmon models.
But the resulting .gguf models do not work with the cuda accelaration. I need to set
--n-gpu-layers 0
to get these models working.With cuda layer offloading i get this (after all the llama-model_loader: - tensor .... lines)
I know that the current persimmon script does only operate on the files provided via the Link in hteir GitHub repository, and that this is going to be changed to work with the huggingface repos, so this may not be changed for the current script at all but only for the new one.
The text was updated successfully, but these errors were encountered: