-
Notifications
You must be signed in to change notification settings - Fork 249
Accelerate activation sparsity with activation compression #1920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jcaip, this seems an interesting take on activation sparsity. I would like to know if the model activations are highly sparse (>85% onwards), wont restricting them to a 50% sparsity be creating a hard upper bound? I think an unstructured sparse kernel make more sense in such scenarios, and makes for CPU inferencing a case as well. |
@agrawal-aka Yes that's correct, we have a max 2x acceleration with 2:4 sparsity at 50%, but theoretically we can push this higher. The difficulty with unstructured sparsity is that 1) it is hard to accelerate on GPU 2) we need to efficiently create the metadata for the sparse matrix at runtime, as we don't have the activations beforehand. Doing so for a more general sparsity pattern is not something I've considered deeply but probably can be done (or at least it should be possible to figure out if this approach is feasible). I've been thinking about combining this with maybe https://openreview.net/pdf?id=gWHQQagPbN CPU inferencing is a good point, but is this something that people care about? If so, I'd love to hear any insights you have here. I've been very GPU focused but not super familiar with the space. |
Hi @jcaip, Thanks for your response. I’m interested in exploring how activation compression might be integrated into model inference. Could you clarify at what point in the forward pass the compression and subsequent decompression should occur? Additionally, are there any specific task items or preliminary PR ideas you’re considering for this feature? On the topic of CPU inferencing, community work from ARM, AWS, Neural Magic, and Cerebras highlights a growing interest in efficiency improvements through quantization and sparsity. For example:
These examples indicate a notable momentum around CPU-based inference, suggesting that further investigation into activation compression could prove valuable across both GPU and CPU contexts. |
From my understanding, activation compression would be of minimal use during inference, because you don't need to store the activations for the backwards pass like you do during training. During training, instead of storing the full activation, you would compress it and stored the compressed activation, and during your bw pass, you would uncompress. I think the only time that this would help for inference is if your model activations don't fit inside your GPU memory, in which case you could load the compressed activations instead of the full ones when doing activation checkpointing. cc @janeyx99 who might know better here.
I think the first step is to see the overhead of these compression routines, I'm unfamiliar so it would be good to know how much memory / loading time we would save. I'm not planning to work on this as I'm busy working on the kernels for 2:4 activation sparsity ATM, if you're interested would gladly accept a PR. Thanks for the links, will check them out. I think for edge / CPU contexts specifically, there may be more room for speedups as you are more likely to be memory bound than compute bound. cc @syed-shakib7 who might be interested in this as well. |
thanks @jcaip, for giving the clarity about activation compression. Also from my understanding for weight sparsity format creation is a one-time overhead before inferencing, but as you mentioned there is development in progress for 2:4 activation sparsity, how is the format creation overhead being handled in that case, if we have to do it at runtime? As you mentioned,
Currently, I am inclined towards working in inferencing, especially for CPU use cases. Do let me know if there are any task items or preliminary PR ideas you think from a weight/activation sparsity inference point of view. Would love to collaborate. |
We've come up with a training recipe for 2:4 activation sparsity, which is outlined in this paper: https://openreview.net/pdf?id=O5feVk7p6Y
The gist of this approach is that:
However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.
We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.
The text was updated successfully, but these errors were encountered: