Adept Persimmon Models not working with CUDA Acceleration #4038

maddes8cht · 2023-11-11T14:46:23Z

I have successfully gguf-converted the base and chat variants of the Adept Persimmon models.

But the resulting .gguf models do not work with the cuda accelaration. I need to set
--n-gpu-layers 0 to get these models working.

With cuda layer offloading i get this (after all the llama-model_loader: - tensor .... lines)

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                   persimmon.context_length u32
llama_model_loader: - kv   3:                 persimmon.embedding_length u32
llama_model_loader: - kv   4:                      persimmon.block_count u32
llama_model_loader: - kv   5:              persimmon.feed_forward_length u32
llama_model_loader: - kv   6:             persimmon.rope.dimension_count u32
llama_model_loader: - kv   7:             persimmon.attention.head_count u32
llama_model_loader: - kv   8:          persimmon.attention.head_count_kv u32
llama_model_loader: - kv   9:                   persimmon.rope.freq_base f32
llama_model_loader: - kv  10:     persimmon.attention.layer_norm_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:               general.quantization_version u32
llama_model_loader: - kv  18:                          general.file_type u32
llama_model_loader: - type  f32:  434 tensors
llama_model_loader: - type q4_1:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 76599/262144 vs 259/262144 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = persimmon
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 262144
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 25000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = mostly Q4_1
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.67 GiB (5.18 BPW)
llm_load_print_meta: general.name   = persimmon-8b-chat
llm_load_print_meta: BOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: EOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 71128 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4967.56 MB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 39/39 layers to GPU
llm_load_tensors: VRAM used: 840.03 MB
............................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 25000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 288.00 MB
llama_new_context_with_model: kv self size  =  288.00 MB
llama_build_graph: non-view tensors processed: 1481/1481
llama_new_context_with_model: compute buffer total size = 7.66 MB
llama_new_context_with_model: VRAM scratch buffer: 1.03 MB
llama_new_context_with_model: total VRAM used: 5456.41 MB (model: 5167.38 MB, context: 289.03 MB)
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7510: src1->backend == GGML_BACKEND_GPU

I know that the current persimmon script does only operate on the files provided via the Link in hteir GitHub repository, and that this is going to be changed to work with the huggingface repos, so this may not be changed for the current script at all but only for the new one.

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-11-11T16:12:40Z

Are you using a very recent version? There was a Persimmon fix merged just yesterday: #4010

maddes8cht · 2023-11-11T21:45:13Z

Yes, i was using that verry version #4010 on windows 10.
Earlier versions didn't produce a working model at all.

It IS working with --n-gpu-layers 0, but not with a positive vaue of --n-gpu-layers.
Maybe someone can checkout the models on https://huggingface.co/collections/maddes8cht/adept-persimmon-models-gguf-654f89a18d8c3bf4ddc8e842 on other machines / OSes, but on my windows, it only works with --n-gpu-layers 0.
Other models i created do work with layer offloading in cuda.

maddes8cht · 2023-11-11T22:21:57Z

Wait - maybe i found an error on my pilpeline, causing me to checkout an old version although i already pulled the new one.
Need to recheck if things work...

KerfuffleV2 · 2023-11-11T22:22:25Z

I just downloaded https://huggingface.co/maddes8cht/adept-persimmon-8b-base-gguf/blob/main/adept-persimmon-8b-base-Q4_K_M.gguf and can reproduce your issue with the latest master.

So don't worry about that. I'm looking at this.

maddes8cht · 2023-11-11T22:24:31Z

okay, you can reproduce the error with the files i created, but I`m not sure right now if they where actually created by #4010..

KerfuffleV2 · 2023-11-11T22:25:11Z

#4010 only changed evaluating the models, nothing that would affect converting/producing them.

maddes8cht · 2023-11-11T22:26:52Z

Okay, then thanks for having a look...

KerfuffleV2 · 2023-11-11T22:31:55Z

So the problem seems to be that there's no CUDA kernel for ReLU. I tried adding one but weirdly it still doesn't work. I might be able to submit a pull to fix this.

KerfuffleV2 · 2023-11-11T22:54:34Z

@maddes8cht Please give #4041 a try. My very first CUDA kernels! (Of course I "wrote" them by cut-and-pasting other working ones and changing some simple stuff but we won't bring that little detail up.) Note you'll only be able to use -ngl 37 and lower for the 8B. The last 2 non-repeating layers still can't be offloaded. (36 vs 37 makes a big speed difference though.)

maddes8cht · 2023-11-12T12:52:35Z

@KerfuffleV2
I compiled your #4041, and it works as described by you:
I can offload 36 layers, and i can offload 37 layers, which makes a big difference in speed:
With 36 layers i get

llama_print_timings:        load time =  114127.27 ms
llama_print_timings:      sample time =      64.78 ms /    65 runs   (    1.00 ms per token,  1003.47 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   17361.35 ms /    86 runs   (  201.88 ms per token,     4.95 tokens per second)
llama_print_timings:       total time =  105090.92 ms

with 37 i get

llama_print_timings:        load time =  106922.01 ms
llama_print_timings:      sample time =     153.19 ms /   142 runs   (    1.08 ms per token,   926.92 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   12002.42 ms /   163 runs   (   73.63 ms per token,    13.58 tokens per second)
llama_print_timings:       total time =   32245.73 ms

And with more than 37 layers it does not work at all.
Will this change before merging?

KerfuffleV2 · 2023-11-12T13:24:09Z

@maddes8cht

Will this change before merging?

I don't really know enough myself to fix it, so I guess the answer is it depends on whether someone else helps with figuring out how to offload those last two KV cache layers.

I think the problem is that the CUDA CPY op kernel can't handle 4D tensors but adapting it to be able to do that is currently beyond my ability. It's very possible that once that one is solved then there will be another issue to deal with. Seems like Persimmon does some unusual stuff compared to the other models.

maddes8cht · 2023-11-12T13:37:11Z

So for a final solution we would probably have to invite @JohannesGaessler to have a look at it?

As an intermediate solution it would be fine to never try to offload more than said 37 layers with a persimmon model.
With -ngl 40,it would still only offload 37 layers.
Would that be easier to do?

That would still be better than to crash, and should be fine for a merge.
We would still have a working persimmon model.

KerfuffleV2 · 2023-11-12T13:50:15Z

With -ngl 40,it would still only offload 37 layers.

I actually looked into doing that and as far as I could see, there isn't a way to restrict the -ngl value like that in the loader code. I might have missed something though, I'm really not familiar with that part.

Having to manually set -ngl based on the layers is better than not being able to offload at all, but obviously it's not ideal.

JohannesGaessler · 2023-11-12T14:23:20Z

So for a final solution we would probably have to invite @JohannesGaessler to have a look at it?

I currently do not have the time to work on llama.cpp. At the earliest I will have more time in January but even then I have other priorities than this model.

KerfuffleV2 · 2023-11-13T08:40:10Z

@maddes8cht #4041 will be merged. I was able to make the error message more helpful, but GG said it should crash just so we don't forget about the problem which makes sense. It's easy to ignore stuff that doesn't cause any pain.

llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: CUDA backend missing Persimmon CUDA ops, can offload at most 37 layers. See: https://github.com/ggerganov/llama.cpp/issues/4038
error loading model: Persimmon CUDA offload failed
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/blah/adept-persimmon-8b-base-Q4_K_M.gguf'
main: error: unable to load model

So for now, it seems like you'll be able to offload at most the repeating layers + 1. There may be a way to refactor the Persimmon graph to avoid these problems but that's not really something I can help with.

We should leave this issue open until the problem is fully resolved.

quick edit: A bit off topic, but have you tried with OpenCL? I just get garbage output for anything higher than the repeating layers. So -ngl 37 doesn't work, only -ngl 36.

github-actions · 2024-04-02T01:11:35Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

maddes8cht added the bug-unconfirmed label Nov 11, 2023

cebtenzzre added bug Something isn't working and removed bug-unconfirmed labels Nov 11, 2023

KerfuffleV2 mentioned this issue Nov 11, 2023

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

Merged

github-actions bot added the stale label Mar 19, 2024

jart mentioned this issue Mar 22, 2024

tried to use persimmon gguf, crashes with GGML_ASSERT: llama.cpp/llama.cpp:4879: n_embd_head/2 == hparams.n_rot Mozilla-Ocho/llamafile#283

Open

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adept Persimmon Models not working with CUDA Acceleration #4038

Adept Persimmon Models not working with CUDA Acceleration #4038

maddes8cht commented Nov 11, 2023 •

edited

Loading

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023 •

edited

Loading

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023 •

edited

Loading

maddes8cht commented Nov 12, 2023

KerfuffleV2 commented Nov 12, 2023

maddes8cht commented Nov 12, 2023 •

edited

Loading

KerfuffleV2 commented Nov 12, 2023

JohannesGaessler commented Nov 12, 2023

KerfuffleV2 commented Nov 13, 2023 •

edited

Loading

github-actions bot commented Apr 2, 2024

Adept Persimmon Models not working with CUDA Acceleration #4038

Adept Persimmon Models not working with CUDA Acceleration #4038

Comments

maddes8cht commented Nov 11, 2023 • edited Loading

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023 • edited Loading

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

maddes8cht commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023

KerfuffleV2 commented Nov 11, 2023 • edited Loading

maddes8cht commented Nov 12, 2023

KerfuffleV2 commented Nov 12, 2023

maddes8cht commented Nov 12, 2023 • edited Loading

KerfuffleV2 commented Nov 12, 2023

JohannesGaessler commented Nov 12, 2023

KerfuffleV2 commented Nov 13, 2023 • edited Loading

github-actions bot commented Apr 2, 2024

maddes8cht commented Nov 11, 2023 •

edited

Loading

maddes8cht commented Nov 11, 2023 •

edited

Loading

KerfuffleV2 commented Nov 11, 2023 •

edited

Loading

maddes8cht commented Nov 12, 2023 •

edited

Loading

KerfuffleV2 commented Nov 13, 2023 •

edited

Loading