Skip to content

Commit a78473b

Browse files
omer-dayanDarkLight1337
authored andcommitted
Add docs for runai_streamer_sharded (vllm-project#17093)
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
1 parent 9a4bf1a commit a78473b

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

docs/source/models/extensions/runai_model_streamer.md

+26
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,29 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer -
5151
:::{note}
5252
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
5353
:::
54+
55+
## Sharded Model Loading
56+
57+
vLLM also supports loading sharded models using Run:ai Model Streamer. This is particularly useful for large models that are split across multiple files. To use this feature, use the `--load-format runai_streamer_sharded` flag:
58+
59+
```console
60+
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
61+
```
62+
63+
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
64+
65+
```console
66+
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
67+
```
68+
69+
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
70+
71+
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
72+
73+
```console
74+
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
75+
```
76+
77+
:::{note}
78+
The sharded loader is particularly efficient for tensor or pipeline parallel models where each worker only needs to read its own shard rather than the entire checkpoint.
79+
:::

0 commit comments

Comments
 (0)