Skip to content

Commit cdefd50

Browse files
reidliu41reidliu41
authored and
Mu Huai
committed
[doc] add install tips (vllm-project#17373)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
1 parent 4a8338f commit cdefd50

File tree

5 files changed

+29
-10
lines changed

5 files changed

+29
-10
lines changed

docs/source/features/quantization/fp8.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,12 @@ To produce performant FP8 quantized models with vLLM, you'll need to install the
4444
pip install llmcompressor
4545
```
4646

47+
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
48+
49+
```console
50+
pip install vllm lm-eval==0.4.4
51+
```
52+
4753
## Quantization Process
4854

4955
The quantization process involves three main steps:
@@ -86,20 +92,14 @@ recipe = QuantizationModifier(
8692
# Apply the quantization algorithm.
8793
oneshot(model=model, recipe=recipe)
8894

89-
# Save the model.
95+
# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
9096
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
9197
model.save_pretrained(SAVE_DIR)
9298
tokenizer.save_pretrained(SAVE_DIR)
9399
```
94100

95101
### 3. Evaluating Accuracy
96102

97-
Install `vllm` and `lm-evaluation-harness`:
98-
99-
```console
100-
pip install vllm lm-eval==0.4.4
101-
```
102-
103103
Load and run the model in `vllm`:
104104

105105
```python

docs/source/features/quantization/int4.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](
1818
pip install llmcompressor
1919
```
2020

21+
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
22+
23+
```console
24+
pip install vllm lm-eval==0.4.4
25+
```
26+
2127
## Quantization Process
2228

2329
The quantization process involves four main steps:
@@ -87,7 +93,7 @@ oneshot(
8793
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
8894
)
8995

90-
# Save the compressed model
96+
# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
9197
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
9298
model.save_pretrained(SAVE_DIR, save_compressed=True)
9399
tokenizer.save_pretrained(SAVE_DIR)

docs/source/features/quantization/int8.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,12 @@ To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](
1919
pip install llmcompressor
2020
```
2121

22+
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
23+
24+
```console
25+
pip install vllm lm-eval==0.4.4
26+
```
27+
2228
## Quantization Process
2329

2430
The quantization process involves four main steps:
@@ -91,7 +97,7 @@ oneshot(
9197
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
9298
)
9399

94-
# Save the compressed model
100+
# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
95101
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
96102
model.save_pretrained(SAVE_DIR, save_compressed=True)
97103
tokenizer.save_pretrained(SAVE_DIR)

docs/source/features/quantization/quantized_kvcache.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ oneshot(
126126
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
127127
)
128128

129-
# Save quantized model
129+
# Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
130130
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
131131
model.save_pretrained(SAVE_DIR, save_compressed=True)
132132
tokenizer.save_pretrained(SAVE_DIR)

docs/source/features/quantization/quark.md

+7
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,12 @@ pip install amd-quark
1919
You can refer to [Quark installation guide](https://quark.docs.amd.com/latest/install.html)
2020
for more installation details.
2121

22+
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
23+
24+
```console
25+
pip install vllm lm-eval==0.4.4
26+
```
27+
2228
## Quantization Process
2329

2430
After installing Quark, we will use an example to illustrate how to use Quark.
@@ -150,6 +156,7 @@ LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
150156
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
151157
export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP
152158

159+
# Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant
153160
EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
154161
exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR)
155162
with torch.no_grad():

0 commit comments

Comments
 (0)