Skip to content

Accelerate activation sparsity with activation compression #1920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jcaip opened this issue Mar 18, 2025 · 5 comments
Open

Accelerate activation sparsity with activation compression #1920

jcaip opened this issue Mar 18, 2025 · 5 comments
Labels
good first issue Good for newcomers

Comments

@jcaip
Copy link
Contributor

jcaip commented Mar 18, 2025

We've come up with a training recipe for 2:4 activation sparsity, which is outlined in this paper: https://openreview.net/pdf?id=O5feVk7p6Y

The gist of this approach is that:

  1. we find high level of activation sparsity (> 85%) when training SquaredRELU based FFNs instead of SwiGLU FFNs. These Squared-RELU based FFNs show minimal to no accuracy loss.
  2. We accelerate the sparse activation x dense weight matmul with 2:4 sparsity. We can naively sparsity for the forwards pass, dropping values to fit the 2:4 constraint if they do not fit. For the backwards pass, we need some special sauce to mantain accuraccy.

However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.

We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.

@jcaip jcaip added the good first issue Good for newcomers label Mar 18, 2025
@agrawal-aka
Copy link
Contributor

Hi @jcaip, this seems an interesting take on activation sparsity. I would like to know if the model activations are highly sparse (>85% onwards), wont restricting them to a 50% sparsity be creating a hard upper bound? I think an unstructured sparse kernel make more sense in such scenarios, and makes for CPU inferencing a case as well.

@jcaip
Copy link
Contributor Author

jcaip commented Apr 8, 2025

@agrawal-aka Yes that's correct, we have a max 2x acceleration with 2:4 sparsity at 50%, but theoretically we can push this higher. The difficulty with unstructured sparsity is that 1) it is hard to accelerate on GPU 2) we need to efficiently create the metadata for the sparse matrix at runtime, as we don't have the activations beforehand. Doing so for a more general sparsity pattern is not something I've considered deeply but probably can be done (or at least it should be possible to figure out if this approach is feasible). I've been thinking about combining this with maybe https://openreview.net/pdf?id=gWHQQagPbN

CPU inferencing is a good point, but is this something that people care about? If so, I'd love to hear any insights you have here. I've been very GPU focused but not super familiar with the space.

@agrawal-aka
Copy link
Contributor

Hi @jcaip,

Thanks for your response.

I’m interested in exploring how activation compression might be integrated into model inference. Could you clarify at what point in the forward pass the compression and subsequent decompression should occur? Additionally, are there any specific task items or preliminary PR ideas you’re considering for this feature?

On the topic of CPU inferencing, community work from ARM, AWS, Neural Magic, and Cerebras highlights a growing interest in efficiency improvements through quantization and sparsity. For example:

  1. ARM’s blog on LLM inference on the Neoverse V2 using int4 kernels
  2. AWS’s posts on optimized PyTorch 2.0 inference with Graviton processors, showing up to 50% cost savings
  3. AWS’s posts on SLM inferencing using CPUs
  4. Neural Magic’s demonstrations of significant speedups with fused sparse and quantized kernels
  5. Cerebras’s exploration of a 70% unstructured sparse LLaMA model achieving high accuracy with CPU inference via DeepSparse

These examples indicate a notable momentum around CPU-based inference, suggesting that further investigation into activation compression could prove valuable across both GPU and CPU contexts.
Looking forward to your thoughts!

@jcaip
Copy link
Contributor Author

jcaip commented Apr 15, 2025

@agrawal-aka

Could you clarify at what point in the forward pass the compression and subsequent decompression should occur?

From my understanding, activation compression would be of minimal use during inference, because you don't need to store the activations for the backwards pass like you do during training. During training, instead of storing the full activation, you would compress it and stored the compressed activation, and during your bw pass, you would uncompress.

I think the only time that this would help for inference is if your model activations don't fit inside your GPU memory, in which case you could load the compressed activations instead of the full ones when doing activation checkpointing. cc @janeyx99 who might know better here.

Additionally, are there any specific task items or preliminary PR ideas you’re considering for this feature?

I think the first step is to see the overhead of these compression routines, I'm unfamiliar so it would be good to know how much memory / loading time we would save. I'm not planning to work on this as I'm busy working on the kernels for 2:4 activation sparsity ATM, if you're interested would gladly accept a PR.

Thanks for the links, will check them out. I think for edge / CPU contexts specifically, there may be more room for speedups as you are more likely to be memory bound than compute bound. cc @syed-shakib7 who might be interested in this as well.

ved1beta added a commit to ved1beta/ao that referenced this issue Apr 18, 2025
ved1beta added a commit to ved1beta/ao that referenced this issue Apr 18, 2025
@agrawal-aka
Copy link
Contributor

thanks @jcaip, for giving the clarity about activation compression.

Also from my understanding for weight sparsity format creation is a one-time overhead before inferencing, but as you mentioned there is development in progress for 2:4 activation sparsity, how is the format creation overhead being handled in that case, if we have to do it at runtime?

As you mentioned,

From my understanding, activation compression would be of minimal use during inference

Currently, I am inclined towards working in inferencing, especially for CPU use cases. Do let me know if there are any task items or preliminary PR ideas you think from a weight/activation sparsity inference point of view. Would love to collaborate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants