-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Metal prompt processing / inference intermittently spins but doesn't produce output #2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
* metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.
Okay this is bizarre. I am looking at b5ffb28 more closely again - perplexity is stable but nonetheless also seems to abort inference prematurely and after just a few interactions is a tad dull / falls quickly into loops so I think something else is going on . In many cases I get a string of non-printables (such as \x1C) I went back to 3ebb009 and things are working MUCH much better I've noticed that censored models will sometimes just produce blanks when probed for something inappropriate (instead of getting into a fight) and I hope that isn't what's going on here |
Can confirm I also see this "hang" behavior with a192860 (current Command: Model converted from M1 MBP 16GB. |
Weird - it was supposed to be fixed with recent commits. I've done quite a lot of testing with vanilla LLaMA and Falcon models and haven't observed this issue |
Any diagnostics output / logs I could grab to help debug? |
Has anyone checked out this repo, might have some solution? https://github.com/jankais3r/LLaMA_MPS I don't program much so figured I'd just throw it out there for the experts! |
Here's an excerpt of what gets generated (Audreyana is the bot):
After this point, Audreyana stops generating anything at all. At the low level, the queue is flooded with jobs ... a ton of various rando node ranges being processed ... far more than usual. Whatever is deciding to stuff the queue with these requests seems to be the culprit. It could be a self-feeding problem in that llama hits a token it does not like and then spits out a bunch of control characters (as above) which then pollute the context for subsequent rounds. I'm not sure if the control characters are an artifact of c++ processing or if there is garbage in llama2 itself or some additional Unicode scrubbing is needed |
This model runs OK on M2 Ultra with latest
@ProjectAtlantis-dev This sounds like a different issue from the one discussed initially. You can open a separate issue for this |
Resolved |
The outward symptom is that prompt processing / inference spins up the GPU and churns up a ton of busy work but no tokens ever come out (at least not printables - I have seen a long string of \x1C before it stops responding entirely). It doesn't really "hang" forever because it eventually stops generating. It may happen immediately on initial prompt processing or during chat interaction. However once things go sour, it does not appear to recover with further input
Under the hood, I see GPU usage spike but no tokens get produced. ggml_metal_graph_compute() decides to start encoding a ton of "stuff" (the queue is flooded with nodes to process... far more than appropriate) but ggml_metal_get_tensor() never extracts anything meaningful. I would guess that something in the context is getting trashed. Unfortunately, setting threads to 1 does not avoid it. Moreover, it seems that ALL threads in the pool suddenly get very busy, not just one
UPDATE: Temp fix #2686 doesn't appear to solve the issue, just reduce thread churn
For me this bug shows up most obviously after Aug 16 commit bf83bff commit (see discussion) since 3ebb009 seems quite solid
Note that matrix multiply was moved to metal/GPU at the beginning of August as a way to speed up prompt processing but then metal was slower with llama2 (gqa) and so a custom matrix solution was developed. I'm way out of my depth here and probably not accurately describing the intent of these PRs
#2615
Prompt processing
#2428
I am using a 64GB m1 with a longer prompt (about 400 tokens) and the file I used to test with: upstage-llama-2-70b-instruct-v2.ggmlv3.q5_K_M.bin. I am not using MPS
The text was updated successfully, but these errors were encountered: