[Tracker] TorchAO activation sparsity acceleration 🚀 #2095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

2 of 9 tasks

jcaip opened this issue Apr 22, 2025 · 0 comments

Open

2 of 9 tasks

[Tracker] TorchAO activation sparsity acceleration 🚀 #2095

jcaip opened this issue Apr 22, 2025 · 0 comments

Assignees

Contributor

jcaip commented Apr 22, 2025

This is a tracker issue for all the different ways we can accelerate training / inference with activation sparsity in TorchAO.

Inference

Accelerate memory-bound bs=1 decode use cases with a selective weight loading kernel, like that described in TEAL / CATS.

Accelerate compute-bound bs=n prefill use cases with 2:4 activation sparsity, as we outlined in https://arxiv.org/pdf/2503.16672
- Add fast fused sparsification + fp8 rowwise + srelu kernels ([WIP] 2:4 activation sparsity #2012)
- David also apparently has a triton kernel that does this, so we should benchmark and compare these two to see which one's faster.
- Add rowwise-fp8 + 2:4 sparse CUTLASS Kernel (Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671)
- Add performance tuning configs for above kernel (Add config selection for row-wise scaled FP8 sparse CUTLASS-based kernel #1940)
- Add transposed support to the rowwise-fp8 sparse CUTLASS kernel. The above kernel assumes that the weight is 2:4 sparse. Since 2:4 sparsity is only supported for the first operand, I'm using the fact $xW^T = (Wx^T)^T$ to be able to use the kernel for activation sparsity, but this means that the output of the kernel is in col-major format instead of row-major.

Training

Activation compression to accelerate 2:4 sparse training (#1920 activation sparsity + compression #2076) has an implementation that I need to benchmark / review.
Implement custom sparse training kernels outlined in our ICLR paper. Lower priority for now.

jcaip self-assigned this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment