Skip to content

Bug: Adreno740 GPU device can't load model in Android system #8965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FranzKafkaYu opened this issue Aug 10, 2024 · 4 comments
Closed

Bug: Adreno740 GPU device can't load model in Android system #8965

FranzKafkaYu opened this issue Aug 10, 2024 · 4 comments
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale

Comments

@FranzKafkaYu
Copy link

FranzKafkaYu commented Aug 10, 2024

What happened?

I tried to run llama.cpp in Samsug Galaxy Tab S9 Ultra,the Android System is Android13.and I have compiled these libraries accoding the guide.I used these libraries in my APK and when I load model it met a fatal crash.

Name and Version

tag:3400,commit:97bdd26e,support GPU acceleration:true

What operating system are you seeing the problem on?

Other? (Please let us know in description)

Relevant log output

08-10 16:06:07.269 30852 30926 I LLama-android: build info:tag:3400,commit:97bdd26e,support GPU acceleration:true
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: loaded meta data with 20 key-value pairs and 290 tensors from /data/user/0/com.set.ai/files/ai_model.gguf (version GGUF V3 (latest))
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   1:                               general.name str              = seres_model
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 896
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 4864
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 14
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 2
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv  10:                          general.file_type u32              = 2
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
08-10 16:06:07.334 30852 30926 I LLama-android: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
08-10 16:06:07.362 30852 30926 I LLama-android: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "", "&", "'", ...
08-10 16:06:07.371 30852 30926 I LLama-android: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151643
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {-107732238428550025633549537852171948407976130944385741446622902831951351080628521997716918865536884607535372703052150861230582896697462443075202517321702951537854339417602815342824911808967527308411848461112923592282659498077075523239936.000000or message in messages }{ 0f lo...
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - type  f32:  121 tensors
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - type q4_0:  168 tensors
08-10 16:06:07.402 30852 30926 I LLama-android: llama_model_loader: - type q8_0:    1 tensors
08-10 16:06:07.562 30852 30926 I LLama-android: llm_load_vocab: special tokens cache size = 293
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_vocab: token to piece cache size = 0.9338 MB
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: format           = GGUF V3 (latest)
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: arch             = qwen2
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: vocab type       = BPE
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_vocab          = 151936
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_merges         = 151387
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: vocab_only       = 0
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_ctx_train      = 32768
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_embd           = 896
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_layer          = 24
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_head           = 14
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_head_kv        = 2
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_rot            = 64
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_swa            = 0
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_embd_head_k    = 64
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_embd_head_v    = 64
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_gqa            = 7
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_embd_k_gqa     = 128
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_embd_v_gqa     = 128
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: f_norm_eps       = 0.0e+00
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: f_logit_scale    = 0.0e+00
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_ff             = 4864
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_expert         = 0
08-10 16:06:07.616 30852 30926 I LLama-android: llm_load_print_meta: n_expert_used    = 0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: causal attn      = 1
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: pooling type     = 0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: rope type        = 2
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: rope scaling     = linear
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: freq_base_train  = 1000000.0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: freq_scale_train = 1
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: n_ctx_orig_yarn  = 32768
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: rope_finetuned   = unknown
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: ssm_d_conv       = 0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: ssm_d_inner      = 0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: ssm_d_state      = 0
08-10 16:06:07.617 30852 30926 I LLama-android: llm_load_print_meta: ssm_dt_rank      = 0
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: model type       = 1B
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: model ftype      = Q4_0
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: model params     = 494.03 M
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: model size       = 330.17 MiB (5.61 BPW) 
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: general.name     = ai_model
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: LF token         = 148848 'ÄĬ'
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
08-10 16:06:07.618 30852 30926 I LLama-android: llm_load_print_meta: max token length = 256
08-10 16:06:07.624 30852 30926 D vulkan  : searching for layers in '/data/app/~~OvYsMz18c3DQFfK8i-sPtQ==/com.set.ai-gU7EJsFpEOK5rgbEU08wQw==/lib/arm64'
08-10 16:06:07.624 30852 30926 D vulkan  : searching for layers in '/data/app/~~OvYsMz18c3DQFfK8i-sPtQ==/com.set.ai-gU7EJsFpEOK5rgbEU08wQw==/base.apk!/lib/arm64-v8a'
08-10 16:06:07.627 30852 30926 W Adreno-AppProfiles: Could not find QSPM HAL service. Skipping adreno profile processing.
08-10 16:06:07.627 30852 30926 I AdrenoVK-0: ===== BEGIN DUMP OF OVERRIDDEN SETTINGS =====
08-10 16:06:07.627 30852 30926 I AdrenoVK-0: ===== END DUMP OF OVERRIDDEN SETTINGS =====
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: QUALCOMM build          : d44197479c, I2991b7e11e
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Build Date              : 05/31/23
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Shader Compiler Version : E031.41.03.36
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Local Branch            : 
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Remote Branch           : 
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Remote Branch           : 
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Reconstruct Branch      : 
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Build Config            : S P 14.1.4 AArch64
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Driver Path             : /vendor/lib64/hw/vulkan.adreno.so
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Driver Version          : 0676.32
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: PFP                     : 0x01740158
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: ME                      : 0x00000000
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Application Name    : ggml-vulkan
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Application Version : 0x00000001
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Engine Name         : (null)
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Engine Version      : 0x00000000
08-10 16:06:07.628 30852 30926 I AdrenoVK-0: Api Version         : 0x00402000
08-10 16:06:09.099 30852 30926 I AdrenoVK-0: Failed to link shaders.
08-10 16:06:09.099 30852 30926 I AdrenoVK-0: Pipeline create failed
08-10 16:06:09.108 30852 30926 E LLama-android: llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown
08-10 16:06:09.108 30852 30926 E LLama-android: llama_load_model_from_file: failed to load model
08-10 16:06:09.132 30852 30926 E LLama-android: llama_new_context_with_model: model cannot be NULL
08-10 16:06:09.132 30852 30926 F libc    : exiting due to SIG_DFL handler for signal 11, ucontext 0x7317ea5e20
@FranzKafkaYu FranzKafkaYu added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Aug 10, 2024
@github-actions github-actions bot added the stale label Sep 10, 2024
@FranzKafkaYu
Copy link
Author

FranzKafkaYu commented Sep 13, 2024

Update in 2024/09/13: finallly I CAN use Vulkan backend in Android System with Mali GPU:

09-13 14:46:42.003 29923   484 I llama-android.cpp: build info:tag:3503,commit:0fbbd884,support GPU acceleration:true
09-13 14:46:42.003 29923   484 I llama-android.cpp: system info:AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
09-13 14:46:42.004 29923   484 I llama-android.cpp: Loading model from /data/user/0/com.example.ai/files/AI.gguf
09-13 14:46:42.061 29923   484 I llama-android.cpp: llama_model_loader: loaded meta data with 32 key-value pairs and 290 tensors from /data/user/0/com.seres.aivoiceassistant/files/seres_model.gguf (version GGUF V3 (latest))
09-13 14:46:42.061 29923   484 I llama-android.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   1:                               general.type str              = model
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   2:                               general.name str              = Qwen2 0.5B Instruct
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   4:                           general.basename str              = Qwen2
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 0.5B
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
09-13 14:46:42.062 29923   484 I llama-android.cpp: llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-0.5B
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  13:                          qwen2.block_count u32              = 24
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  14:                       qwen2.context_length u32              = 32768
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 896
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 4864
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 14
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 2
09-13 14:46:42.063 29923   484 I llama-android.cpp: llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
09-13 14:46:42.064 29923   484 I llama-android.cpp: llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
09-13 14:46:42.064 29923   484 I llama-android.cpp: llama_model_loader: - kv  21:                          general.file_type u32              = 14
09-13 14:46:42.064 29923   484 I llama-android.cpp: llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
09-13 14:46:42.064 29923   484 I llama-android.cpp: llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
09-13 14:46:42.115 29923   484 I llama-android.cpp: llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "", "&", "'", ...
09-13 14:46:42.127 29923   484 I llama-android.cpp: llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
09-13 14:46:42.193 29923   484 I llama-android.cpp: llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - kv  30:                    tokenizer.chat_template str              = { 45358000342027376826582226793079494978536804014395455830443385929404386720687094409997545952899419186858395500209268036995739998909139943970409346798798085432955252350389385345239973876717475932972674074899806483906560.000000or message in messages }{ 0f lo...
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - kv  31:               general.quantization_version u32              = 2
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - type  f32:  121 tensors
09-13 14:46:42.194 29923   484 I llama-android.cpp: llama_model_loader: - type q5_0:  140 tensors
09-13 14:46:42.195 29923   484 I llama-android.cpp: llama_model_loader: - type q5_1:    4 tensors
09-13 14:46:42.195 29923   484 I llama-android.cpp: llama_model_loader: - type q8_0:    1 tensors
09-13 14:46:42.195 29923   484 I llama-android.cpp: llama_model_loader: - type q4_K:   21 tensors
09-13 14:46:42.195 29923   484 I llama-android.cpp: llama_model_loader: - type q5_K:    3 tensors
09-13 14:46:42.514 29923   484 I llama-android.cpp: llm_load_vocab: special tokens cache size = 3
09-13 14:46:42.598 29923   484 I llama-android.cpp: llm_load_vocab: token to piece cache size = 0.9308 MB
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: format           = GGUF V3 (latest)
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: arch             = qwen2
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: vocab type       = BPE
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: n_vocab          = 151936
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: n_merges         = 151387
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: vocab_only       = 0
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: n_ctx_train      = 32768
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: n_embd           = 896
09-13 14:46:42.599 29923   484 I llama-android.cpp: llm_load_print_meta: n_layer          = 24
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_head           = 14
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_head_kv        = 2
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_rot            = 64
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_swa            = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_embd_head_k    = 64
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_embd_head_v    = 64
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_gqa            = 7
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_embd_k_gqa     = 128
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_embd_v_gqa     = 128
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: f_norm_eps       = 0.0e+00
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: f_logit_scale    = 0.0e+00
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_ff             = 4864
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_expert         = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_expert_used    = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: causal attn      = 1
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: pooling type     = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: rope type        = 2
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: rope scaling     = linear
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: freq_base_train  = 1000000.0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: freq_scale_train = 1
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: n_ctx_orig_yarn  = 32768
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: rope_finetuned   = unknown
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: ssm_d_conv       = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: ssm_d_inner      = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: ssm_d_state      = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: ssm_dt_rank      = 0
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: model type       = 1B
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: model ftype      = Q4_K - Small
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: model params     = 494.03 M
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: model size       = 361.94 MiB (6.15 BPW)
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: general.name     = Qwen2 0.5B Instruct
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: LF token         = 148848 'ÄĬ'
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
09-13 14:46:42.600 29923   484 I llama-android.cpp: llm_load_print_meta: max token length = 256
09-13 14:46:55.911 29923   484 I llama-android.cpp: llm_load_tensors: ggml ctx size =    0.25 MiB
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloading 10 repeating layers to GPU
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloaded 10/25 layers to GPU
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors:        CPU buffer size =   361.94 MiB
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: Mali-G720-Immortalis MC12 buffer size =    92.67 MiB
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.931 29923   484 I llama-android.cpp: .
09-13 14:46:55.941 29923   484 I llama-android.cpp: .
09-13 14:46:55.945 29923   484 I llama-android.cpp: .
09-13 14:46:55.950 29923   484 I llama-android.cpp: .
09-13 14:46:55.954 29923   484 I llama-android.cpp: .
09-13 14:46:55.962 29923   484 I llama-android.cpp: .
09-13 14:46:55.963 29923   484 I llama-android.cpp: .
09-13 14:46:55.972 29923   484 I llama-android.cpp: .
09-13 14:46:55.976 29923   484 I llama-android.cpp: .
09-13 14:46:55.982 29923   484 I llama-android.cpp: .
09-13 14:46:55.994 29923   484 I llama-android.cpp: .
09-13 14:46:55.995 29923   484 I llama-android.cpp: .
09-13 14:46:56.009 29923   484 I llama-android.cpp: .
09-13 14:46:56.014 29923   484 I llama-android.cpp: .
09-13 14:46:56.024 29923   484 I llama-android.cpp: .
09-13 14:46:56.029 29923   484 I llama-android.cpp: .
09-13 14:46:56.049 29923   484 I llama-android.cpp: .
09-13 14:46:56.055 29923   484 I llama-android.cpp: .
09-13 14:46:56.075 29923   484 I llama-android.cpp: .
09-13 14:46:56.079 29923   484 I llama-android.cpp: .
09-13 14:46:56.079 29923   484 I llama-android.cpp:
09-13 14:46:56.129 29923   484 I llama-android.cpp: Using 6 threads
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: n_ctx      = 2048
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: n_batch    = 2048
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: n_ubatch   = 512
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: flash_attn = 0
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: freq_base  = 1000000.0
09-13 14:46:56.129 29923   484 I llama-android.cpp: llama_new_context_with_model: freq_scale = 1
09-13 14:46:56.131 29923   484 I llama-android.cpp: llama_kv_cache_init: Vulkan_Host KV buffer size =    14.00 MiB
09-13 14:46:56.135 29923   484 I llama-android.cpp: llama_kv_cache_init: Mali-G720-Immortalis MC12 KV buffer size =    10.00 MiB
09-13 14:46:56.135 29923   484 I llama-android.cpp: llama_new_context_with_model: KV self size  =   24.00 MiB, K (f16):   12.00 MiB, V (f16):   12.00 MiB
09-13 14:46:56.136 29923   484 I llama-android.cpp: llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
09-13 14:46:56.191 29923   484 I llama-android.cpp: llama_new_context_with_model: Mali-G720-Immortalis MC12 compute buffer size =   436.44 MiB
09-13 14:46:56.191 29923   484 I llama-android.cpp: llama_new_context_with_model: Vulkan_Host compute buffer size =     5.76 MiB
09-13 14:46:56.191 29923   484 I llama-android.cpp: llama_new_context_with_model: graph nodes  = 846
09-13 14:46:56.191 29923   484 I llama-android.cpp: llama_new_context_with_model: graph splits = 200  

BUT the Performance is terrible,I don't know why.I tried other Project such as MediaPipe,MLC_LLM,they can run in GPU and work perfectly.while llama.cpp is bad in this situation.

@ggerganov
Copy link
Member

More than half the model is running on the CPU:

09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloading 10 repeating layers to GPU
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloaded 10/25 layers to GPU

Try to offload all layers to the GPU

@FranzKafkaYu
Copy link
Author

More than half the model is running on the CPU:

09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloading 10 repeating layers to GPU
09-13 14:46:55.931 29923   484 I llama-android.cpp: llm_load_tensors: offloaded 10/25 layers to GPU

Try to offload all layers to the GPU

thank you sir,I tried to offload all layers to GPU:

09-13 15:41:08.289 31938  5505 I llama-android.cpp: build info:tag:3503,commit:0fbbd884,support GPU acceleration:true
09-13 15:41:08.289 31938  5505 I llama-android.cpp: system info:AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
09-13 15:41:08.290 31938  5505 I llama-android.cpp: Loading model from /data/user/0/com.example.ai/files/Al.gguf
09-13 15:41:08.347 31938  5505 I llama-android.cpp: llama_model_loader: loaded meta data with 32 key-value pairs and 290 tensors from /data/user/0/com.seres.aivoiceassistant/files/seres_model.gguf (version GGUF V3 (latest))
09-13 15:41:08.347 31938  5505 I llama-android.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
09-13 15:41:08.347 31938  5505 I llama-android.cpp: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   1:                               general.type str              = model
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   2:                               general.name str              = Qwen2 0.5B Instruct
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   4:                           general.basename str              = Qwen2
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 0.5B
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-0.5B
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  13:                          qwen2.block_count u32              = 24
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  14:                       qwen2.context_length u32              = 32768
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 896
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 4864
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 14
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 2
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  21:                          general.file_type u32              = 14
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
09-13 15:41:08.348 31938  5505 I llama-android.cpp: llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
09-13 15:41:08.398 31938  5505 I llama-android.cpp: llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "", "&", "'", ...
09-13 15:41:08.411 31938  5505 I llama-android.cpp: llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  30:                    tokenizer.chat_template str              = { 45358000342027376826582226793079494978536804014395455830443385929404386720687094409997545952899419186858395500209268036995739998909139943970409346798798085432955252350389385345239973876717475932972674074899806483906560.000000or message in messages }{ 0f lo...
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - kv  31:               general.quantization_version u32              = 2
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - type  f32:  121 tensors
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - type q5_0:  140 tensors
09-13 15:41:08.460 31938  5505 I llama-android.cpp: llama_model_loader: - type q5_1:    4 tensors
09-13 15:41:08.461 31938  5505 I llama-android.cpp: llama_model_loader: - type q8_0:    1 tensors
09-13 15:41:08.461 31938  5505 I llama-android.cpp: llama_model_loader: - type q4_K:   21 tensors
09-13 15:41:08.461 31938  5505 I llama-android.cpp: llama_model_loader: - type q5_K:    3 tensors
09-13 15:41:08.791 31938  5505 I llama-android.cpp: llm_load_vocab: special tokens cache size = 3
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_vocab: token to piece cache size = 0.9308 MB
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: format           = GGUF V3 (latest)
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: arch             = qwen2
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: vocab type       = BPE
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: n_vocab          = 151936
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: n_merges         = 151387
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: vocab_only       = 0
09-13 15:41:08.869 31938  5505 I llama-android.cpp: llm_load_print_meta: n_ctx_train      = 32768
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_embd           = 896
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_layer          = 24
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_head           = 14
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_head_kv        = 2
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_rot            = 64
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_swa            = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_embd_head_k    = 64
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_embd_head_v    = 64
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_gqa            = 7
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_embd_k_gqa     = 128
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_embd_v_gqa     = 128
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: f_norm_eps       = 0.0e+00
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: f_logit_scale    = 0.0e+00
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_ff             = 4864
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_expert         = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_expert_used    = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: causal attn      = 1
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: pooling type     = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: rope type        = 2
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: rope scaling     = linear
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: freq_base_train  = 1000000.0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: freq_scale_train = 1
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: n_ctx_orig_yarn  = 32768
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: rope_finetuned   = unknown
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: ssm_d_conv       = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: ssm_d_inner      = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: ssm_d_state      = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: ssm_dt_rank      = 0
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: model type       = 1B
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: model ftype      = Q4_K - Small
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: model params     = 494.03 M
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: model size       = 361.94 MiB (6.15 BPW)
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: general.name     = Qwen2 0.5B Instruct
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: LF token         = 148848 'ÄĬ'
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
09-13 15:41:08.870 31938  5505 I llama-android.cpp: llm_load_print_meta: max token length = 256
09-13 15:41:14.658 31938  5505 I llama-android.cpp: llm_load_tensors: ggml ctx size =    0.25 MiB
09-13 15:41:14.701 31938  5505 I llama-android.cpp: llm_load_tensors: offloading 24 repeating layers to GPU
09-13 15:41:14.701 31938  5505 I llama-android.cpp: llm_load_tensors: offloading non-repeating layers to GPU
09-13 15:41:14.701 31938  5505 I llama-android.cpp: llm_load_tensors: offloaded 25/25 layers to GPU
09-13 15:41:14.701 31938  5505 I llama-android.cpp: llm_load_tensors:        CPU buffer size =   137.94 MiB
09-13 15:41:14.701 31938  5505 I llama-android.cpp: llm_load_tensors: Mali-G720-Immortalis MC12 buffer size =   361.94 MiB
09-13 15:41:14.701 31938  5505 I llama-android.cpp: .
09-13 15:41:14.742 31938  5505 I llama-android.cpp: .
09-13 15:41:14.753 31938  5505 I llama-android.cpp: .
09-13 15:41:14.757 31938  5505 I llama-android.cpp: .
09-13 15:41:14.769 31938  5505 I llama-android.cpp: .
09-13 15:41:14.771 31938  5505 I llama-android.cpp: .
09-13 15:41:14.785 31938  5505 I llama-android.cpp: .
09-13 15:41:14.790 31938  5505 I llama-android.cpp: .
09-13 15:41:14.800 31938  5505 I llama-android.cpp: .
09-13 15:41:14.807 31938  5505 I llama-android.cpp: .
09-13 15:41:14.816 31938  5505 I llama-android.cpp: .
09-13 15:41:14.829 31938  5505 I llama-android.cpp: .
09-13 15:41:14.831 31938  5505 I llama-android.cpp: .
09-13 15:41:14.841 31938  5505 I llama-android.cpp: .
09-13 15:41:14.844 31938  5505 I llama-android.cpp: .
09-13 15:41:14.854 31938  5505 I llama-android.cpp: .
09-13 15:41:14.859 31938  5505 I llama-android.cpp: .
09-13 15:41:14.867 31938  5505 I llama-android.cpp: .
09-13 15:41:14.871 31938  5505 I llama-android.cpp: .
09-13 15:41:14.881 31938  5505 I llama-android.cpp: .
09-13 15:41:14.884 31938  5505 I llama-android.cpp: .
09-13 15:41:14.894 31938  5505 I llama-android.cpp: .
09-13 15:41:14.898 31938  5505 I llama-android.cpp: .
09-13 15:41:14.908 31938  5505 I llama-android.cpp: .
09-13 15:41:14.924 31938  5505 I llama-android.cpp: .
09-13 15:41:14.927 31938  5505 I llama-android.cpp: .
09-13 15:41:14.944 31938  5505 I llama-android.cpp: .
09-13 15:41:14.949 31938  5505 I llama-android.cpp: .
09-13 15:41:14.958 31938  5505 I llama-android.cpp: .
09-13 15:41:14.962 31938  5505 I llama-android.cpp: .
09-13 15:41:14.971 31938  5505 I llama-android.cpp: .
09-13 15:41:14.974 31938  5505 I llama-android.cpp: .
09-13 15:41:14.983 31938  5505 I llama-android.cpp: .
09-13 15:41:14.985 31938  5505 I llama-android.cpp: .
09-13 15:41:14.995 31938  5505 I llama-android.cpp: .
09-13 15:41:15.001 31938  5505 I llama-android.cpp: .
09-13 15:41:15.009 31938  5505 I llama-android.cpp: .
09-13 15:41:15.023 31938  5505 I llama-android.cpp: .
09-13 15:41:15.026 31938  5505 I llama-android.cpp: .
09-13 15:41:15.043 31938  5505 I llama-android.cpp: .
09-13 15:41:15.047 31938  5505 I llama-android.cpp: .
09-13 15:41:15.061 31938  5505 I llama-android.cpp: .
09-13 15:41:15.063 31938  5505 I llama-android.cpp: .
09-13 15:41:15.074 31938  5505 I llama-android.cpp: .
09-13 15:41:15.076 31938  5505 I llama-android.cpp: .
09-13 15:41:15.089 31938  5505 I llama-android.cpp: .
09-13 15:41:15.116 31938  5505 I llama-android.cpp: .
09-13 15:41:15.116 31938  5505 I llama-android.cpp:
09-13 15:41:15.131 31938  5505 I llama-android.cpp: Using 6 threads
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: n_ctx      = 2048
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: n_batch    = 2048
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: n_ubatch   = 512
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: flash_attn = 0
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: freq_base  = 1000000.0
09-13 15:41:15.131 31938  5505 I llama-android.cpp: llama_new_context_with_model: freq_scale = 1
09-13 15:41:15.139 31938  5505 I llama-android.cpp: llama_kv_cache_init: Mali-G720-Immortalis MC12 KV buffer size =    24.00 MiB
09-13 15:41:15.139 31938  5505 I llama-android.cpp: llama_new_context_with_model: KV self size  =   24.00 MiB, K (f16):   12.00 MiB, V (f16):   12.00 MiB
09-13 15:41:15.139 31938  5505 I llama-android.cpp: llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
09-13 15:41:15.179 31938  5505 I llama-android.cpp: llama_new_context_with_model: Mali-G720-Immortalis MC12 compute buffer size =   298.50 MiB
09-13 15:41:15.179 31938  5505 I llama-android.cpp: llama_new_context_with_model: Vulkan_Host compute buffer size =     5.76 MiB
09-13 15:41:15.179 31938  5505 I llama-android.cpp: llama_new_context_with_model: graph nodes  = 846
09-13 15:41:15.179 31938  5505 I llama-android.cpp: llama_new_context_with_model: graph splits = 2  

and I test:

 generated 57 tokens in 18514 ms  

With PURE CPU it will cost 1700ms.

and I also found sometimes it will crash:

 Abort message: 'terminating with uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost'  

the crash is happened in ggml_backend_sched_graph_compute_async

@github-actions github-actions bot removed the stale label Sep 14, 2024
@github-actions github-actions bot added the stale label Oct 14, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale
Projects
None yet
Development

No branches or pull requests

2 participants