Skip to content

Adept Persimmon Models not working with CUDA Acceleration #4038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maddes8cht opened this issue Nov 11, 2023 · 16 comments
Closed

Adept Persimmon Models not working with CUDA Acceleration #4038

maddes8cht opened this issue Nov 11, 2023 · 16 comments
Labels
bug Something isn't working stale

Comments

@maddes8cht
Copy link
Contributor

maddes8cht commented Nov 11, 2023

I have successfully gguf-converted the base and chat variants of the Adept Persimmon models.

But the resulting .gguf models do not work with the cuda accelaration. I need to set
--n-gpu-layers 0 to get these models working.

With cuda layer offloading i get this (after all the llama-model_loader: - tensor .... lines)

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                   persimmon.context_length u32
llama_model_loader: - kv   3:                 persimmon.embedding_length u32
llama_model_loader: - kv   4:                      persimmon.block_count u32
llama_model_loader: - kv   5:              persimmon.feed_forward_length u32
llama_model_loader: - kv   6:             persimmon.rope.dimension_count u32
llama_model_loader: - kv   7:             persimmon.attention.head_count u32
llama_model_loader: - kv   8:          persimmon.attention.head_count_kv u32
llama_model_loader: - kv   9:                   persimmon.rope.freq_base f32
llama_model_loader: - kv  10:     persimmon.attention.layer_norm_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:               general.quantization_version u32
llama_model_loader: - kv  18:                          general.file_type u32
llama_model_loader: - type  f32:  434 tensors
llama_model_loader: - type q4_1:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 76599/262144 vs 259/262144 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = persimmon
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 262144
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 25000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = mostly Q4_1
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.67 GiB (5.18 BPW)
llm_load_print_meta: general.name   = persimmon-8b-chat
llm_load_print_meta: BOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: EOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 71128 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4967.56 MB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 39/39 layers to GPU
llm_load_tensors: VRAM used: 840.03 MB
............................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 25000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 288.00 MB
llama_new_context_with_model: kv self size  =  288.00 MB
llama_build_graph: non-view tensors processed: 1481/1481
llama_new_context_with_model: compute buffer total size = 7.66 MB
llama_new_context_with_model: VRAM scratch buffer: 1.03 MB
llama_new_context_with_model: total VRAM used: 5456.41 MB (model: 5167.38 MB, context: 289.03 MB)
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7510: src1->backend == GGML_BACKEND_GPU

I know that the current persimmon script does only operate on the files provided via the Link in hteir GitHub repository, and that this is going to be changed to work with the huggingface repos, so this may not be changed for the current script at all but only for the new one.

@KerfuffleV2
Copy link
Collaborator

Are you using a very recent version? There was a Persimmon fix merged just yesterday: #4010

@maddes8cht
Copy link
Contributor Author

maddes8cht commented Nov 11, 2023

Yes, i was using that verry version #4010 on windows 10.
Earlier versions didn't produce a working model at all.

It IS working with --n-gpu-layers 0, but not with a positive vaue of --n-gpu-layers.
Maybe someone can checkout the models on https://huggingface.co/collections/maddes8cht/adept-persimmon-models-gguf-654f89a18d8c3bf4ddc8e842 on other machines / OSes, but on my windows, it only works with --n-gpu-layers 0.
Other models i created do work with layer offloading in cuda.

@maddes8cht
Copy link
Contributor Author

Wait - maybe i found an error on my pilpeline, causing me to checkout an old version although i already pulled the new one.
Need to recheck if things work...

@KerfuffleV2
Copy link
Collaborator

I just downloaded https://huggingface.co/maddes8cht/adept-persimmon-8b-base-gguf/blob/main/adept-persimmon-8b-base-Q4_K_M.gguf and can reproduce your issue with the latest master.

So don't worry about that. I'm looking at this.

@maddes8cht
Copy link
Contributor Author

okay, you can reproduce the error with the files i created, but I`m not sure right now if they where actually created by #4010..

@KerfuffleV2
Copy link
Collaborator

#4010 only changed evaluating the models, nothing that would affect converting/producing them.

@maddes8cht
Copy link
Contributor Author

Okay, then thanks for having a look...

@cebtenzzre cebtenzzre added bug Something isn't working and removed bug-unconfirmed labels Nov 11, 2023
@KerfuffleV2
Copy link
Collaborator

So the problem seems to be that there's no CUDA kernel for ReLU. I tried adding one but weirdly it still doesn't work. I might be able to submit a pull to fix this.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 11, 2023

@maddes8cht Please give #4041 a try. My very first CUDA kernels! (Of course I "wrote" them by cut-and-pasting other working ones and changing some simple stuff but we won't bring that little detail up.) Note you'll only be able to use -ngl 37 and lower for the 8B. The last 2 non-repeating layers still can't be offloaded. (36 vs 37 makes a big speed difference though.)

@maddes8cht
Copy link
Contributor Author

@KerfuffleV2
I compiled your #4041, and it works as described by you:
I can offload 36 layers, and i can offload 37 layers, which makes a big difference in speed:
With 36 layers i get

llama_print_timings:        load time =  114127.27 ms
llama_print_timings:      sample time =      64.78 ms /    65 runs   (    1.00 ms per token,  1003.47 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   17361.35 ms /    86 runs   (  201.88 ms per token,     4.95 tokens per second)
llama_print_timings:       total time =  105090.92 ms

with 37 i get

llama_print_timings:        load time =  106922.01 ms
llama_print_timings:      sample time =     153.19 ms /   142 runs   (    1.08 ms per token,   926.92 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   12002.42 ms /   163 runs   (   73.63 ms per token,    13.58 tokens per second)
llama_print_timings:       total time =   32245.73 ms

And with more than 37 layers it does not work at all.
Will this change before merging?

@KerfuffleV2
Copy link
Collaborator

@maddes8cht

Will this change before merging?

I don't really know enough myself to fix it, so I guess the answer is it depends on whether someone else helps with figuring out how to offload those last two KV cache layers.

I think the problem is that the CUDA CPY op kernel can't handle 4D tensors but adapting it to be able to do that is currently beyond my ability. It's very possible that once that one is solved then there will be another issue to deal with. Seems like Persimmon does some unusual stuff compared to the other models.

@maddes8cht
Copy link
Contributor Author

maddes8cht commented Nov 12, 2023

So for a final solution we would probably have to invite @JohannesGaessler to have a look at it?

As an intermediate solution it would be fine to never try to offload more than said 37 layers with a persimmon model.
With -ngl 40,it would still only offload 37 layers.
Would that be easier to do?

That would still be better than to crash, and should be fine for a merge.
We would still have a working persimmon model.

@KerfuffleV2
Copy link
Collaborator

With -ngl 40,it would still only offload 37 layers.

I actually looked into doing that and as far as I could see, there isn't a way to restrict the -ngl value like that in the loader code. I might have missed something though, I'm really not familiar with that part.

Having to manually set -ngl based on the layers is better than not being able to offload at all, but obviously it's not ideal.

@JohannesGaessler
Copy link
Collaborator

So for a final solution we would probably have to invite @JohannesGaessler to have a look at it?

I currently do not have the time to work on llama.cpp. At the earliest I will have more time in January but even then I have other priorities than this model.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 13, 2023

@maddes8cht #4041 will be merged. I was able to make the error message more helpful, but GG said it should crash just so we don't forget about the problem which makes sense. It's easy to ignore stuff that doesn't cause any pain.

llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: CUDA backend missing Persimmon CUDA ops, can offload at most 37 layers. See: https://github.com/ggerganov/llama.cpp/issues/4038
error loading model: Persimmon CUDA offload failed
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/blah/adept-persimmon-8b-base-Q4_K_M.gguf'
main: error: unable to load model

So for now, it seems like you'll be able to offload at most the repeating layers + 1. There may be a way to refactor the Persimmon graph to avoid these problems but that's not really something I can help with.

We should leave this issue open until the problem is fully resolved.

quick edit: A bit off topic, but have you tried with OpenCL? I just get garbage output for anything higher than the repeating layers. So -ngl 37 doesn't work, only -ngl 36.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants