Skip to content

Commit 85ec209

Browse files
authored
Add readme doc for experiemental
Differential Revision: D64477890 Pull Request resolved: #1130
1 parent c49edbc commit 85ec209

File tree

1 file changed

+50
-0
lines changed

1 file changed

+50
-0
lines changed

torchao/experimental/docs/readme.md

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# TorchAO experimental
2+
3+
TorchAO experimental contains lowbit ARM CPU and Metal kernels for linear and embedding ops.
4+
5+
## Building ARM CPU kernels
6+
7+
To build torch ops that use the lowbit kernels, run `sh build_torchao_ops.sh <aten|executorch>` from torchao/experimental.
8+
9+
For example, to build ATen ops, run `sh build_torchao_ops.sh aten` (this requires PyTorch). Similarly, to build the ExecuTorch ops, run `sh build_torchao_ops executorch` (this requires ExecuTorch).
10+
11+
After running the script, the op libraries will be in
12+
```
13+
cmake-out/lib/libtorchao_ops_aten.{dylib|so} # ATen op library
14+
cmake-out/lib/libtorchao_ops_executorch.a # ExecuTorch op library
15+
```
16+
17+
## Quantizing models
18+
Once the ATen ops are built, you can quantize PyTorch models with them. The quantized models can be run in eager model, compiled, used with AOTI, or exported. The exported models can be lowered to ExecuTorch.
19+
20+
```python
21+
import torch
22+
torch.ops.load_library("cmake-out/lib/libtorchao_ops_aten.dylib") # make sure this path is correct on your machine
23+
from torchao.experimental.quant_api import Int8DynActIntxWeightLinearQuantizer, IntxWeightEmbeddingQuantizer
24+
25+
my_model = Model()
26+
27+
embedding_quantizer = IntxWeightEmbeddingQuantizer(
28+
device="cpu",
29+
precision=torch.float32,
30+
bitwidth=2, # bitwidth to quantize embedding weights to (values 1-7 are supported)
31+
groupsize=32, # groupsize for embedding weights (any multiple of 32 is supported)
32+
)
33+
quantized_model = embedding_quantizer.quantize(my_model)
34+
35+
36+
linear_quantizer = Int8DynActIntxWeightLinearQuantizer(
37+
device="cpu",
38+
precision=torch.float32,
39+
bitwidth=4, # bitwidth to quantize linear weights to (values 1-7 are supported)
40+
groupsize=256, # groupsize for quantization (any multiple of 16 is supported)
41+
has_weight_zeros=False, # whether to quantize weights with scales and zeros, or scales-only
42+
)
43+
quantized_model = linear_quantizer.quantize(quantized_model)
44+
```
45+
46+
If you get stuck on the above steps, working examples for both linear and embedding are in torchao/experimental/tests/test_linear_8bit_act_xbit_weight_quantizer.py and torchao/experimental/tests/test_embedding_xbit_quantizer.py. For example, running `python tests/test_linear_8bit_act_xbit_weight_quantizer.py` loads the ops, creates a toy model, quantizes the model, and runs it in eager, compile, AOTI, and exports the model.
47+
48+
## Available in torchchat
49+
50+
TorchAO experimental kernels are [available in torchchat](https://github.com/pytorch/torchchat/blob/main/docs/quantization.md#experimental-torchao-lowbit-kernels), PyTorch's solution for running LLMs locally. Torchchat integration uses similar steps to above.

0 commit comments

Comments
 (0)