Skip to content

Metal prompt processing / inference intermittently spins but doesn't produce output #2678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ProjectAtlantis-dev opened this issue Aug 20, 2023 · 9 comments
Labels
bug Something isn't working high priority Very important issue

Comments

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented Aug 20, 2023

The outward symptom is that prompt processing / inference spins up the GPU and churns up a ton of busy work but no tokens ever come out (at least not printables - I have seen a long string of \x1C before it stops responding entirely). It doesn't really "hang" forever because it eventually stops generating. It may happen immediately on initial prompt processing or during chat interaction. However once things go sour, it does not appear to recover with further input

Under the hood, I see GPU usage spike but no tokens get produced. ggml_metal_graph_compute() decides to start encoding a ton of "stuff" (the queue is flooded with nodes to process... far more than appropriate) but ggml_metal_get_tensor() never extracts anything meaningful. I would guess that something in the context is getting trashed. Unfortunately, setting threads to 1 does not avoid it. Moreover, it seems that ALL threads in the pool suddenly get very busy, not just one

UPDATE: Temp fix #2686 doesn't appear to solve the issue, just reduce thread churn

For me this bug shows up most obviously after Aug 16 commit bf83bff commit (see discussion) since 3ebb009 seems quite solid

Note that matrix multiply was moved to metal/GPU at the beginning of August as a way to speed up prompt processing but then metal was slower with llama2 (gqa) and so a custom matrix solution was developed. I'm way out of my depth here and probably not accurately describing the intent of these PRs
#2615

Prompt processing
#2428

I am using a 64GB m1 with a longer prompt (about 400 tokens) and the file I used to test with: upstage-llama-2-70b-instruct-v2.ggmlv3.q5_K_M.bin. I am not using MPS

@ProjectAtlantis-dev ProjectAtlantis-dev changed the title Metal prompt processing spins but doesn't produce output on bigger Macbooks Metal prompt processing sometimes spins but doesn't produce output on bigger Macbooks Aug 20, 2023
@ProjectAtlantis-dev ProjectAtlantis-dev changed the title Metal prompt processing sometimes spins but doesn't produce output on bigger Macbooks Metal prompt processing sometimes spins but doesn't produce output on bigger Macbooks (related to 2615 and 2428) Aug 20, 2023
ProjectAtlantis-dev referenced this issue Aug 20, 2023
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
@ProjectAtlantis-dev ProjectAtlantis-dev changed the title Metal prompt processing sometimes spins but doesn't produce output on bigger Macbooks (related to 2615 and 2428) Metal prompt processing / inference breaks (spins but doesn't produce output) on bigger Macbooks (related to 2615 and 2428) Aug 20, 2023
@ProjectAtlantis-dev ProjectAtlantis-dev changed the title Metal prompt processing / inference breaks (spins but doesn't produce output) on bigger Macbooks (related to 2615 and 2428) Metal prompt processing / inference intermittently breaks (spins but doesn't produce output) on bigger Macbooks (related to 2615 and 2428) Aug 20, 2023
@ProjectAtlantis-dev
Copy link
Author

ProjectAtlantis-dev commented Aug 22, 2023

Okay this is bizarre. I am looking at b5ffb28 more closely again - perplexity is stable but nonetheless also seems to abort inference prematurely and after just a few interactions is a tad dull / falls quickly into loops so I think something else is going on . In many cases I get a string of non-printables (such as \x1C)

I went back to 3ebb009 and things are working MUCH much better

I've noticed that censored models will sometimes just produce blanks when probed for something inappropriate (instead of getting into a fight) and I hope that isn't what's going on here

@ggerganov ggerganov added bug Something isn't working high priority Very important issue labels Aug 22, 2023
@ProjectAtlantis-dev ProjectAtlantis-dev changed the title Metal prompt processing / inference intermittently breaks (spins but doesn't produce output) on bigger Macbooks (related to 2615 and 2428) Metal prompt processing / inference intermittently spins but doesn't produce output Aug 23, 2023
@yagil
Copy link

yagil commented Aug 23, 2023

Can confirm I also see this "hang" behavior with a192860 (current master) after a few back and forths.

Command: ./main -m ../../openorca-platypus2-13b.ggmlv3.q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f ../../prompts/chat-with-bob.txt -ngl 1

Model converted from ggml using convert-llama-ggmlv3-to-gguf.py

M1 MBP 16GB.

@ggerganov
Copy link
Member

Weird - it was supposed to be fixed with recent commits. I've done quite a lot of testing with vanilla LLaMA and Falcon models and haven't observed this issue

@yagil
Copy link

yagil commented Aug 23, 2023

Any diagnostics output / logs I could grab to help debug?

@BBC-Esq
Copy link

BBC-Esq commented Aug 23, 2023

Has anyone checked out this repo, might have some solution? https://github.com/jankais3r/LLaMA_MPS I don't program much so figured I'd just throw it out there for the experts!

@ProjectAtlantis-dev
Copy link
Author

ProjectAtlantis-dev commented Aug 24, 2023

Here's an excerpt of what gets generated (Audreyana is the bot):

'Chad: (looks out the window)\n' +
'Audreyana: 10 kr change. (hands him the coin)\n' +
'Chad: okay thanks! (puts the change into the tip jar)\n' +
'Audreyana: thank you! (smiles) So, what brings you to Greenland?\n' +
'Chad: Business\n' +
'Audreyana: \x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\x1C\n' +
'Chad: crap\n' +
'Audreyana: ',

After this point, Audreyana stops generating anything at all. At the low level, the queue is flooded with jobs ... a ton of various rando node ranges being processed ... far more than usual. Whatever is deciding to stuff the queue with these requests seems to be the culprit. It could be a self-feeding problem in that llama hits a token it does not like and then spits out a bunch of control characters (as above) which then pollute the context for subsequent rounds. I'm not sure if the control characters are an artifact of c++ processing or if there is garbage in llama2 itself or some additional Unicode scrubbing is needed

@ProjectAtlantis-dev
Copy link
Author

ProjectAtlantis-dev commented Aug 24, 2023

Well, I am trying dadbed9 and have not run into any issues so far. I put in a check for character code 28 just in case #2758

@ggerganov
Copy link
Member

Can confirm I also see this "hang" behavior with a192860 (current master) after a few back and forths.

Command: ./main -m ../../openorca-platypus2-13b.ggmlv3.q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f ../../prompts/chat-with-bob.txt -ngl 1

Model converted from ggml using convert-llama-ggmlv3-to-gguf.py

M1 MBP 16GB.

This model runs OK on M2 Ultra with latest master

 $ ▶ ./main -m ./models/openorca-platypus2-13b.ggmlv3.q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt -ngl 1
main: build = 1044 (44d5462)
main: seed  = 1692855885
llama_model_loader: loaded meta data with 18 key-value pairs and 363 tensors from ./models/openorca-platypus2-13b.ggmlv3.q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  5120, 32002,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q4_0     [  5120, 32002,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   10:              blk.0.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   11:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   15:         blk.1.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   16:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   19:              blk.1.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   20:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   24:         blk.2.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   25:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   28:              blk.2.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   29:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   33:         blk.3.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   34:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   37:              blk.3.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   38:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   42:         blk.4.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   43:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   46:              blk.4.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   47:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   51:         blk.5.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   52:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   55:              blk.5.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   56:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   60:         blk.6.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   61:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   64:              blk.6.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   65:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   69:         blk.7.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   70:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   73:              blk.7.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   74:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   78:         blk.8.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   79:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   82:              blk.8.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   83:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   88:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   91:              blk.9.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   92:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   97:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  100:             blk.10.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  101:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  106:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  109:             blk.11.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  110:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  115:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  118:             blk.12.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  119:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  124:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  127:             blk.13.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  128:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  133:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  136:             blk.14.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  137:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  142:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  145:             blk.15.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  146:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  151:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  154:             blk.16.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  155:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  160:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  163:             blk.17.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  164:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  169:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  172:             blk.18.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  173:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  178:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  181:             blk.19.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  182:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  187:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  190:             blk.20.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  191:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  196:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  199:             blk.21.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  200:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  203:             blk.22.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  204:        blk.22.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  205:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  208:             blk.22.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  209:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  213:        blk.23.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  214:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  215:           blk.23.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  217:             blk.23.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  218:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  222:        blk.24.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  223:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  224:           blk.24.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  226:             blk.24.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  227:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  231:        blk.25.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  232:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  233:           blk.25.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  235:             blk.25.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  236:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  240:        blk.26.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  241:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  242:           blk.26.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  244:             blk.26.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  245:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  249:        blk.27.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  250:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  253:             blk.27.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  254:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  258:        blk.28.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  259:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  260:           blk.28.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  262:             blk.28.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  263:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  267:        blk.29.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  268:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  269:           blk.29.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  271:             blk.29.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  272:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:             blk.30.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  276:        blk.30.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  277:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  278:           blk.30.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  280:             blk.30.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  281:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  283:             blk.31.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  285:        blk.31.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  286:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  287:           blk.31.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  290:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  291:             blk.32.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  292:             blk.32.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  293:             blk.32.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  294:        blk.32.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  295:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  296:           blk.32.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  297:           blk.32.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  298:             blk.32.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  299:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  300:             blk.33.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  301:             blk.33.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  302:             blk.33.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  303:        blk.33.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  304:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  305:           blk.33.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  306:           blk.33.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  307:             blk.33.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  308:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  309:             blk.34.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  310:             blk.34.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  311:             blk.34.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  312:        blk.34.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  313:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  314:           blk.34.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  315:           blk.34.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  316:             blk.34.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  317:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  318:             blk.35.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  319:             blk.35.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  320:             blk.35.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  321:        blk.35.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  322:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  323:           blk.35.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  324:           blk.35.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  325:             blk.35.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  326:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  327:             blk.36.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  328:             blk.36.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  329:             blk.36.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  330:        blk.36.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  331:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  332:           blk.36.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  333:           blk.36.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  334:             blk.36.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  335:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  336:             blk.37.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  337:             blk.37.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  338:             blk.37.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  339:        blk.37.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  340:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  341:           blk.37.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  342:           blk.37.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  343:             blk.37.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  344:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  345:             blk.38.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  346:             blk.38.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  347:             blk.38.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  348:        blk.38.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  349:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  350:           blk.38.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  351:           blk.38.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  352:             blk.38.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  353:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  354:             blk.39.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  355:             blk.39.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  356:             blk.39.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  357:        blk.39.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  358:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  359:           blk.39.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  360:           blk.39.ffn_down.weight q4_0     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  361:             blk.39.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  362:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                        general.description str     
llama_model_loader: - kv   3:                       llama.context_length u32     
llama_model_loader: - kv   4:                     llama.embedding_length u32     
llama_model_loader: - kv   5:                          llama.block_count u32     
llama_model_loader: - kv   6:                  llama.feed_forward_length u32     
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   8:                 llama.attention.head_count u32     
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32     
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  282 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32002
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 2048
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 5.0e-06
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0 (guessed)
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = openorca-platypus2-13b.ggmlv3.q4_0.bin
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 6983.74 MB (+  400.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1206085c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x120608d00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x120609240 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x120609890 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x120609dd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x12060a310 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x12060a850 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x12060ad90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x12060b460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x12060bae0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x12060c1b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12060c9f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x12060d0c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x12060d790 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x12060de60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x12060e530 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x12060ec00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x12060f2d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x12060f9b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x1206101f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x120610ac0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x120611240 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1206119c0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x1206122c0 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x120612a40 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1206131c0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x120613940 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x1206142c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x120614ce0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x1206154a0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x120615c60 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x120616420 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x120616960 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x120617120 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x1206178e0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x1206180a0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_rope                           0x1206185e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x120618ec0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x120619770 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x12061a020 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x12061a8d0 | th_max = 1024 | th_width =   32
ggml_metal_init: recommendedMaxWorkingSetSize  = 147456.00 MB
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size =   91.41 MB
llama_new_context_with_model: max tensor size =    87.90 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.22 MB, ( 6984.66 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.42 MB, ( 6986.08 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 7388.08 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    90.02 MB, ( 7478.09 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: List first 5 prime numbers
Bob: Of course! The first 5 prime numbers are: 2, 3, 5, 7, and 11.
User: Who is Richard Feynman?
Bob: Richard Feynman was a renowned American physicist. He made significant contributions to the field of quantum mechanics. He was awarded the Nobel Prize in Physics in 1965.
User: How many seconds are in a day?  
Bob: There are 86,400 seconds in a day.
User:

llama_print_timings:        load time =   456.06 ms
llama_print_timings:      sample time =     1.78 ms /    97 runs   (    0.02 ms per token, 54402.69 tokens per second)
llama_print_timings: prompt eval time =   696.48 ms /   126 tokens (    5.53 ms per token,   180.91 tokens per second)
llama_print_timings:        eval time =  2162.47 ms /    96 runs   (   22.53 ms per token,    44.39 tokens per second)
llama_print_timings:       total time = 66252.36 ms

@ProjectAtlantis-dev This sounds like a different issue from the one discussed initially. You can open a separate issue for this

@ProjectAtlantis-dev
Copy link
Author

Resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Very important issue
Projects
None yet
Development

No branches or pull requests

4 participants