Skip to content

Commit 0b7e701

Browse files
authored
[Docs] Update optimization.md doc (#17482)
Signed-off-by: mgoin <mgoin64@gmail.com>
1 parent 947f2f5 commit 0b7e701

File tree

1 file changed

+155
-32
lines changed

1 file changed

+155
-32
lines changed

docs/source/performance/optimization.md

+155-32
Original file line numberDiff line numberDiff line change
@@ -2,65 +2,188 @@
22

33
# Optimization and Tuning
44

5+
This guide covers optimization strategies and performance tuning for vLLM V1.
6+
57
## Preemption
68

79
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
8-
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
9-
available again. When this occurs, the following warning is printed:
10+
In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
11+
available again. When this occurs, you may see the following warning:
1012

1113
```text
12-
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
14+
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
1315
```
1416

1517
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
16-
If you frequently encounter preemptions from the vLLM engine, consider the following actions:
18+
If you frequently encounter preemptions, consider the following actions:
19+
20+
- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
21+
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
22+
- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
23+
- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
1724

18-
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
19-
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
20-
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
21-
- Increase `pipeline_parallel_size`. This approach distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, which indirectly leaves more memory available for KV cache.
25+
You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
2226

23-
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
27+
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
2428

2529
(chunked-prefill)=
2630

2731
## Chunked Prefill
2832

29-
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
33+
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
34+
35+
In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
36+
37+
With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
38+
39+
This policy has two benefits:
40+
41+
- It improves ITL and generation decode because decode requests are prioritized.
42+
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
3043

31-
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
44+
### Performance Tuning with Chunked Prefill
45+
46+
You can tune the performance by adjusting `max_num_batched_tokens`:
47+
48+
- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
49+
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
50+
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs.
51+
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
3252

3353
```python
3454
from vllm import LLM
3555

36-
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
37-
# Set max_num_batched_tokens to tune performance.
38-
# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.
39-
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)
56+
# Set max_num_batched_tokens to tune performance
57+
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384)
4058
```
4159

42-
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
43-
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
60+
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
4461

45-
Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
46-
It batches all pending decode requests to the batch before scheduling any prefill.
47-
When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
48-
If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.
62+
## Parallelism Strategies
4963

50-
This policy has two benefits:
64+
vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.
5165

52-
- It improves ITL and generation decode because decode requests are prioritized.
53-
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
66+
### Tensor Parallelism (TP)
5467

55-
You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
56-
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
57-
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
68+
Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
5869

59-
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
60-
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
70+
**When to use:**
6171

62-
We recommend you set `max_num_batched_tokens > 2048` for throughput.
72+
- When the model is too large to fit on a single GPU
73+
- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput
6374

64-
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
75+
```python
76+
from vllm import LLM
77+
78+
# Split model across 4 GPUs
79+
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4)
80+
```
81+
82+
For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.
83+
84+
### Pipeline Parallelism (PP)
85+
86+
Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.
87+
88+
**When to use:**
89+
90+
- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
91+
- For very deep and narrow models where layer distribution is more efficient than tensor sharding
92+
93+
Pipeline parallelism can be combined with tensor parallelism for very large models:
94+
95+
```python
96+
from vllm import LLM
97+
98+
# Combine pipeline and tensor parallelism
99+
llm = LLM(
100+
model="meta-llama/Llama-3.3-70B-Instruct,
101+
tensor_parallel_size=4,
102+
pipeline_parallel_size=2
103+
)
104+
```
105+
106+
### Expert Parallelism (EP)
107+
108+
Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
109+
110+
**When to use:**
65111

66-
Please try out this feature and let us know your feedback via GitHub issues!
112+
- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
113+
- When you want to balance the expert computation load across GPUs
114+
115+
Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
116+
It will use the same degree of parallelism as what you have set for tensor parallelism.
117+
118+
### Data Parallelism (DP)
119+
120+
Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
121+
122+
**When to use:**
123+
124+
- When you have enough GPUs to replicate the entire model
125+
- When you need to scale throughput rather than model size
126+
- In multi-user environments where isolation between request batches is beneficial
127+
128+
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
129+
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
130+
131+
## Reducing Memory Usage
132+
133+
If you encounter out-of-memory issues, consider these strategies:
134+
135+
### Context Length and Batch Size
136+
137+
You can reduce memory usage by limiting the context length and batch size:
138+
139+
```python
140+
from vllm import LLM
141+
142+
llm = LLM(
143+
model="meta-llama/Llama-3.1-8B-Instruct",
144+
max_model_len=2048, # Limit context window
145+
max_num_seqs=4 # Limit batch size
146+
)
147+
```
148+
149+
### Adjust CUDA Graph Compilation
150+
151+
CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level:
152+
153+
```python
154+
from vllm import LLM
155+
from vllm.config import CompilationConfig, CompilationLevel
156+
157+
llm = LLM(
158+
model="meta-llama/Llama-3.1-8B-Instruct",
159+
compilation_config=CompilationConfig(
160+
level=CompilationLevel.PIECEWISE,
161+
cudagraph_capture_sizes=[1, 2, 4, 8] # Capture fewer batch sizes
162+
)
163+
)
164+
```
165+
166+
Or, if you are not concerned about latency or overall performance, disable CUDA graph compilation entirely with `enforce_eager=True`:
167+
168+
```python
169+
from vllm import LLM
170+
171+
llm = LLM(
172+
model="meta-llama/Llama-3.1-8B-Instruct",
173+
enforce_eager=True # Disable CUDA graph compilation
174+
)
175+
```
176+
177+
### Multimodal Models
178+
179+
For multi-modal models, you can reduce memory usage by limiting the number of images/videos per request:
180+
181+
```python
182+
from vllm import LLM
183+
184+
# Accept up to 2 images per prompt
185+
llm = LLM(
186+
model="Qwen/Qwen2.5-VL-3B-Instruct",
187+
limit_mm_per_prompt={"image": 2}
188+
)
189+
```

0 commit comments

Comments
 (0)