Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

alexsamardzic · 2025-02-05T20:43:08Z

No description provided.

pytorch-bot · 2025-02-05T20:43:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1671

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 75a6195 with merge base 711fa08 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · 2025-02-05T20:52:48Z

The kernel is ready and passes smoke test.

Remaining tasks:

Write a converter to SM90 sparse semi-structured format
Validate the kernel on proper test inputs
Write the benchmark
Write Python-side code: sparsify/quantize method, Llama generator extension, etc.
Provide that kernel is built with SM90a flags when torchao detects H100 card as SM90
Further unify CUDA code with rowwise_scaled_linear_cutlass code
Implement a meaningful config selection.

@cpuhrsch @drisspg

test/test_rowwise_scaled_linear_sparse_cutlass.py

alexsamardzic · 2025-02-20T18:53:59Z

Testing this PR revealed that the sparse compressor in CUTLASS is not treating -0.0 values as zeros. The upstream fix is proposed here.

alexsamardzic · 2025-02-24T11:05:51Z

This PR is ready for review. It contains:

An implementation of two new CUTLASS-based operators:
- Converter to sparse format for FP8 data and SM9x arch, in torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x.
- Row-wise scaled linear operator implementation for sparse FP8 weight and FP8 activation in torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass. For parallel compilation, each operator template instantiation is in a separate .cu file.
The test for later operator in test/test_rowwise_scaled_linear_sparse_cutlass.py (not all tests will pass at the moment because of [QST] About NaNs generated during FP16->FP8 quantization #1766), and the micro-benchmark in benchmarks/benchmark_rowwise_scaled_linear_sparse_cutlass.py.
The corresponding layout and TensorImpl class implementations in torchao/dtypes/floatx/cutlass_semi_sparse_layout.py. Because of a CUTLASS issue with handling minus zero values when compressing dense to sparse tensor, from_plain() method here contains a temporary workaround (the fix for this CUTLASS issue is in the works: Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp NVIDIA/cutlass#2110).
The remaining glue code on the Python side in torchao/ops.py, torchao/dtypes/affine_quantized_tensor.py and torchao/quantization/quant_api.py, including definition of new config Float8DynamicActivationFloat8SemiSparseWeightConfig for the quantize_() method.
An update to torchao/_models/llama/generate.py script, to make it possible to test the new quantization and linear operator within the context of Llama - run with python generate.py --compile --sparsity semi -q float8dq.
Some minor updates for CUTLASS-based integer W4A4/W4A8 stuff.

I'll address the performance tuning (through CUTLASS run-time config selection), that is mentioned as a remaining task above, in a separate PR.

@drisspg The setup.py changes are about activating gencode flags for SM90a when the build is for SM90 - it's clumsy, but it works, so hopefully we could use this approach until eventually switching to CMake builds for the extensions. I'm adding you as a reviewer because of this; also, please add reviewer(s), whoever may be the most appropriate, for the Python side of the code.

@jcaip If you think there is a need, we may discuss eventually exposing mentioned new operators through SparseSemiStructuredTensor.

@gau-nernst With this PR, it's possible to try CUTLASS-based W4A4 operator from the Llama generator - run with python generate.py --compile --sparsity semi -q int4dq-4 (be sure to fetch the model beforehand - instructions are here). The output is not meaningful, maybe it's because the quantization is too tight, but we may want to investigate it further.

jcaip

cc @alexsamardzic Took a first pass, but looks good so far :)

Did you have any numbers for generate.py? I can try grabbing some if not.

jcaip · 2025-02-27T18:36:16Z

setup.py

+                "-DCUTLASS_DEBUG_TRACE_LEVEL=0",
+                "--use_fast_math",
+                "--ftemplate-backtrace-limit=0",
+                # "--keep",


What do these lines do?

The NDEBUG should be always there for a non-debug build, I'm moving it up, among other general nvcc flags. The rest of -D defines are what CUTLASS itself uses for compilation, these won't affect non-CUTLASS .cufiles. The --ftemplate-backtrace-limit=0 will print full list of template instantiations in case of a compile error (the default is to print the first 5 and the last 5); as CUTLASS is heavily templated library, this is really needed to understand and fix these errors. The --fast-math is not really needed and should not be there, I'm removing it now as it affects non-CUTLASS .cu files too. The options under comments are useful for developers to activate sometimes.

(I hope #1659 eventually gets meerged, as all of these flags would easier to handle with CMake. With CUDAExtension, it seems there should be a new extension whenever build flags need to differ, and for this reason I have a separate extension for CUTLASS-based SM90+ kernels. It would be good to have another one for other CUTLASS-based kernels, to apply the flags discussed to these only; but having more extensions slows down the build.)

torchao/_models/llama/generate.py

jcaip · 2025-02-27T19:00:24Z

torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass.cuh

+  // performance is really bad; on the other side, using
+  // KernelTmaWarpSpecializedPingpongFP8FastAccum doesn't seem to
+  // affect the precision much - thus, sticking with it.
+  using KernelSchedule =


How much of a performance degredation is slow accumulation? Is it about the same 2x we see for the non-sparse version?

We have some recipes where this makes a difference for the final accuracy.

For Llama generate.py, it's indeed close to 2x; for the benchmark (benchmarks/benchmark_rowwise_scaled_linear_sparse_cutlass.py) of the operator only, it's even worse - up to 3x for some shapes.

I was thinking about making use_fast_accum an optional argument for the operator. However, that would mean eventually having this as a template argument, with two options, that would double number of templates instantiated during the compilation, that would slow down the compilation twice. So I think it's better to keep it as is for now, it could be added if a need arise. (For the same reason, the number of options for scales/bias/output data types in .cu file is kept to minimum - this is stated explicitly in a comment in the .cuh file.).

jcaip · 2025-02-27T19:14:35Z

torchao/dtypes/uintx/cutlass_int4_packed_layout.py

@@ -110,8 +108,7 @@ def from_plain(
        _layout: Layout,
    ):
        assert zero_point is None or torch.all(zero_point == 0)
-
-        int_data_s4 = ((int_data[:, 1::2] & 0xF) << 4) | (int_data[:, 0::2] & 0xF)
+        int_data_s4 = ((int_data[..., 1::2] & 0xF) << 4) | (int_data[..., 0::2] & 0xF)


just curious here, what's the difference between ... and :?

Need to look into 3D and higher-dimensional tensors to spot the difference: for example, for 3D tensor x, it holds that x[..., -1] is x[:, :, -1], while x[:, -1] is x[:, -1, :].

alexsamardzic · 2025-02-28T14:17:50Z

Did you have any numbers for generate.py? I can try grabbing some if not.

Numbers are nothing to be advertised at the moment: baseline generate.py --compile gives around 180 tokens/sec on H100 that I used for testing, while this kernel generate.py --compile --sparsity semi -q float8dq produces just around 120 tokens/sec. But the run-time configs are not tuned at all for small batch sizes - that's what I mentioned in other comments that I intend to take on next (note also that CUTLASS-based kernels are not good fit for dynamic quantization at the moment as they cannot be fused). On the other hand, the benchmark script shows the speedup vs BF16/BF16 MM for up to 2x for larger shapes.

jcaip

lgtm, feel free to merge

alexsamardzic · 2025-03-13T11:45:50Z

@pytorchbot merge

pytorchmergebot · 2025-03-13T11:46:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This test was added by #1671. This test doesn't pass on ROCm, skip it to unbreak CI and we can fix it later Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c102e19 ghstack-comment-id: 2729651455 Pull Request resolved: #1906

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

cpuhrsch requested a review from jcaip February 5, 2025 21:00

alexsamardzic added float8 sparsity topic: new feature Use this tag if this PR adds a new feature labels Feb 6, 2025

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 2 times, most recently from 5bbcc49 to 6d34b7e Compare February 6, 2025 23:41

jcaip reviewed Feb 7, 2025

View reviewed changes

test/test_rowwise_scaled_linear_sparse_cutlass.py Outdated Show resolved Hide resolved

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 8 times, most recently from bd7288a to f11fae4 Compare February 13, 2025 22:38

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 10 times, most recently from bf65c83 to c0368e3 Compare February 19, 2025 23:33

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch from c0368e3 to 4c63c65 Compare February 20, 2025 19:10

alexsamardzic requested review from drisspg and jcaip February 24, 2025 11:06

alexsamardzic mentioned this pull request Feb 24, 2025

Fix wrong scale eps applied #1770

Open

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 5 times, most recently from d1d96f7 to 4eaece8 Compare February 27, 2025 16:02

jcaip reviewed Feb 27, 2025

View reviewed changes

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch from 4eaece8 to e55367e Compare February 28, 2025 11:48

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 3 times, most recently from e276d04 to 284fc37 Compare March 3, 2025 17:56

jcaip approved these changes Mar 10, 2025

View reviewed changes

Add CUTLASS-based row-wise scaled sparse FP8 kernel

75a6195

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch from 284fc37 to 75a6195 Compare March 11, 2025 21:22

pytorchmergebot added the merging label Mar 13, 2025

pytorchmergebot added the Merged label Mar 13, 2025

pytorchmergebot closed this in c376285 Mar 13, 2025

pytorchmergebot removed the merging label Mar 13, 2025

vkuzo mentioned this pull request Mar 17, 2025

skip the sparse rowwise fp8 cutlass kernel test on ROCm #1906

Merged

alexsamardzic deleted the rowwise-scaled-sparse-fp8-cutlass branch March 18, 2025 21:18

This was referenced Apr 17, 2025

fp8 dtype support for SparseSemiStructuredTensor pytorch/pytorch#151552

Open

[Tracker] TorchAO activation sparsity acceleration 🚀 #2095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

alexsamardzic commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 20, 2025

alexsamardzic commented Feb 24, 2025 •

edited

Loading

jcaip left a comment •

edited

Loading

jcaip Feb 27, 2025

alexsamardzic Feb 28, 2025

jcaip Feb 27, 2025

alexsamardzic Feb 28, 2025

jcaip Feb 27, 2025

alexsamardzic Feb 28, 2025

alexsamardzic commented Feb 28, 2025

jcaip left a comment

alexsamardzic commented Mar 13, 2025

pytorchmergebot commented Mar 13, 2025

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Conversation

alexsamardzic commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1671

✅ No Failures

alexsamardzic commented Feb 5, 2025 • edited Loading

alexsamardzic commented Feb 20, 2025

alexsamardzic commented Feb 24, 2025 • edited Loading

jcaip left a comment • edited Loading

Choose a reason for hiding this comment

jcaip Feb 27, 2025

Choose a reason for hiding this comment

alexsamardzic Feb 28, 2025

Choose a reason for hiding this comment

jcaip Feb 27, 2025

Choose a reason for hiding this comment

alexsamardzic Feb 28, 2025

Choose a reason for hiding this comment

jcaip Feb 27, 2025

Choose a reason for hiding this comment

alexsamardzic Feb 28, 2025

Choose a reason for hiding this comment

alexsamardzic commented Feb 28, 2025

jcaip left a comment

Choose a reason for hiding this comment

alexsamardzic commented Mar 13, 2025

pytorchmergebot commented Mar 13, 2025

Merge started

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 24, 2025 •

edited

Loading

jcaip left a comment •

edited

Loading