Skip to content

Commit 28007b0

Browse files
gshtrasnjhillChenyaaanggcalmettesyangw-dev
authored
Upstream merge 2025 04 25 (#524)
* [BugFix] Remove default multiproc executor `collective_rpc` timeout (vllm-project#17000) Signed-off-by: Nick Hill <nhill@redhat.com> * [Core][V1][TPU] Enable structured decoding on TPU V1 (vllm-project#16499) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Bugfix] validate urls object for multimodal content parts (vllm-project#16990) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * add Dockerfile build vllm against torch nightly (vllm-project#16936) Signed-off-by: Yang Wang <elainewy@meta.com> * [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 (vllm-project#13305) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Signed-off-by: maleksan85 <maleksan@amd.com> Signed-off-by: <> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> * [V1][DP] More robust DP/EP dummy request coordination (vllm-project#16277) Signed-off-by: Nick Hill <nhill@redhat.com> * [BugFix] Revert ROCm Custom Paged Attention Env Flag Check (vllm-project#17022) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> * Revert "[Misc] Add S3 environment variables for better support of MinIO." (vllm-project#17021) * [misc] tune some env vars for GB200 (vllm-project#16992) Signed-off-by: youkaichao <youkaichao@gmail.com> * [INTEL-HPU][v0] Port delayed sampling to upstream (vllm-project#16949) Signed-off-by: Michal Adamczyk <michal.adamczyk@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> * [doc] add download path tips (vllm-project#17013) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix] Triton FA function takes no keyword arguments (vllm-project#16902) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> * [V1] Avoid socket errors during shutdown when requests are in in-flight (vllm-project#16807) Signed-off-by: Nick Hill <nhill@redhat.com> * [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#16998) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Misc] Improve readability of get_open_port function. (vllm-project#17024) Signed-off-by: gitover22 <qidizou88@gmail.com> * [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers (vllm-project#16964) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [CI] Run v1/test_serial_utils.py in CI (vllm-project#16996) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Mistral-format support for compressed-tensors (vllm-project#16803) Signed-off-by: mgoin <mgoin64@gmail.com> * Categorize `tests/kernels/` based on kernel type (vllm-project#16799) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Add top anchor and a note to quantization/bitblas.md (vllm-project#17042) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Ensure that `pid` passed to `kill_process_tree` is `int` for `mypy` (vllm-project#17051) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [CI] Update structured-output label automation (vllm-project#17055) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Improve Transformers backend model loading QoL (vllm-project#17039) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * `CacheConfig.block_size` should always be `int` when used (vllm-project#17052) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Use `@property` and private field for `data_parallel_rank_local` (vllm-project#17053) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar (vllm-project#15949) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> * [BugFix][V1] Fix int32 token index overflow when preparing input ids (vllm-project#16806) * [V1][Spec Decode] Always use argmax for sampling draft tokens (vllm-project#16899) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [CI/Build] workaround for CI build failure (vllm-project#17070) Signed-off-by: csy1204 <josang1204@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Quantization]add prefix for commandA quantized model (vllm-project#17017) * [Minor] Use larger batch sizes for A100/B100/B200/MI300x (vllm-project#17073) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix] Enable V1 usage stats (vllm-project#16986) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> * More informative error when using Transformers backend (vllm-project#16988) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Addendum Fix to support FIPS enabled machines with MD5 hashing (vllm-project#17043) Signed-off-by: sydarb <areebsyed237@gmail.com> * [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… (vllm-project#16472) Signed-off-by: 开哲 <kaizhe.zy@alibaba-inc.com> Co-authored-by: 开哲 <kaizhe.zy@alibaba-inc.com> * [V1] Update structured output (vllm-project#16812) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [doc] update to hyperlink (vllm-project#17096) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Add docs for runai_streamer_sharded (vllm-project#17093) Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Chore] Remove Sampler from Model Code (vllm-project#17084) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * Disable enforce_eager for V1 TPU sampler and structured output tests (vllm-project#17016) Signed-off-by: mgoin <mgoin64@gmail.com> * Simplify `TokenizerGroup` (vllm-project#16790) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Fix OOT registration test (vllm-project#17099) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][PP] Optimization: continue scheduling prefill chunks (vllm-project#17080) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * [Misc] Remove OLMo2 config copy (vllm-project#17066) Signed-off-by: Isotr0py <2037008807@qq.com> * Improve static type checking in `LoRAModelRunnerMixin` (vllm-project#17104) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning (vllm-project#16954) Signed-off-by: shen-shanshan <467638484@qq.com> * [Frontend] Using matryoshka_dimensions control the allowed output dimensions. (vllm-project#16970) * Add missing rocm_skinny_gemms kernel test to CI (vllm-project#17060) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] refactor example series - structured outputs (vllm-project#17040) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (vllm-project#16665) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [CI] Add automation for the `tool-calling` github label (vllm-project#17118) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Updating builkite job for IBM Power (vllm-project#17111) Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com> * existing torch installation pip command fix for docs (vllm-project#17059) * Molmo Requirements (vllm-project#17026) Signed-off-by: Eyshika Agarwal <eyshikaengineer@gmail.com> Signed-off-by: eyshika <eyshikaengineer@gmail.com> * Add `:markdownhelp:` to `EngineArgs` docs so markdown docstrings render properly (vllm-project#17124) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Improve configs - `LoRAConfig` + `PromptAdapterConfig` (vllm-project#16980) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Generate correct github links for decorated functions (vllm-project#17125) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add collective_rpc to llm engine (vllm-project#16999) Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai> * Add chat template for Llama 4 models (vllm-project#16428) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * [Misc] Add example to run DeepSeek with Ray Serve LLM (vllm-project#17134) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * Better error message for missing mistral params.json (vllm-project#17132) Signed-off-by: mgoin <mgoin64@gmail.com> * Use custom address for listening socket (vllm-project#15988) Signed-off-by: Jens Glaser <glaserj@ornl.gov> * [FEAT] [ROCm]: AITER Fused MOE V1 Support (vllm-project#16752) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 (vllm-project#16864) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * fix float16 support for kimi-vl (vllm-project#17156) Co-authored-by: zhouzaida <zhouzaida@msh.team> * [Doc] V1 : Update LoRA status (vllm-project#17133) Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com> Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com> * [Docs] Fix True->true in supported_models.md (vllm-project#17141) * Move missed `SchedulerConfig` args into scheduler config group in `EngineArgs` (vllm-project#17131) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] Clean up redundant code in uniproc_executor.py (vllm-project#16762) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton (vllm-project#15099) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [Misc] Benchmark Serving Script Support Appending Results (vllm-project#17028) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance (vllm-project#16457) Signed-off-by: cynthieye <yexin93@qq.com> Co-authored-by: MagnetoWang <magnetowang@outlook.com> * [Bugfix] remove fallback in guided_json (int range, patterns) (vllm-project#16725) Signed-off-by: csy1204 <josang1204@gmail.com> Co-authored-by: 조상연[플레이스 AI] <sang-yeon.cho@navercorp.com> * [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization (vllm-project#15734) Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> * [Doc] Add headings to improve gptqmodel.md (vllm-project#17164) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (vllm-project#17158) * [Doc] Add two links to disagg_prefill.md (vllm-project#17168) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * [Doc] Move todo out of beam search docstring (vllm-project#17183) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * [Bugfix] Fix mistral model tests (vllm-project#17181) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix Mistral ChatCompletionRequest Body Exception (vllm-project#16769) Signed-off-by: Jasmond Loh <Jasmond.Loh@hotmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * Fix API typo and remove FP8 on V1 restriction --------- Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Chenyaaang <chenyangli@google.com> Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> Signed-off-by: Yang Wang <elainewy@meta.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Signed-off-by: maleksan85 <maleksan@amd.com> Signed-off-by: <> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Michal Adamczyk <michal.adamczyk@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: gitover22 <qidizou88@gmail.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: csy1204 <josang1204@gmail.com> Signed-off-by: sydarb <areebsyed237@gmail.com> Signed-off-by: 开哲 <kaizhe.zy@alibaba-inc.com> Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com> Signed-off-by: Eyshika Agarwal <eyshikaengineer@gmail.com> Signed-off-by: eyshika <eyshikaengineer@gmail.com> Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Jens Glaser <glaserj@ornl.gov> Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com> Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: cynthieye <yexin93@qq.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Jasmond Loh <Jasmond.Loh@hotmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com> Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com> Co-authored-by: Yang Wang <elainewy@meta.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: huafeng <qidizou88@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Sangyeon Cho <josang1204@gmail.com> Co-authored-by: Chen Xia <cxia0209@gmail.com> Co-authored-by: Areeb Syed <areebsyed237@gmail.com> Co-authored-by: 张宇 <zhangyuygss@outlook.com> Co-authored-by: 开哲 <kaizhe.zy@alibaba-inc.com> Co-authored-by: omer-dayan <omdayan@nvidia.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Aaruni Aggarwal <47731267+AaruniAggarwal@users.noreply.github.com> Co-authored-by: Atilla <48064466+atilla00@users.noreply.github.com> Co-authored-by: Eyshika Agarwal <eyshikaengineer@gmail.com> Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: jglaser <glaserj@ornl.gov> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: zhouzaida <zhouzaida@msh.team> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: yexin(叶鑫) <yexin93@qq.com> Co-authored-by: MagnetoWang <magnetowang@outlook.com> Co-authored-by: 조상연[플레이스 AI] <sang-yeon.cho@navercorp.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jasmond L <120363110+JasmondL@users.noreply.github.com>
2 parents c3f61dd + a9e7a00 commit 28007b0

File tree

287 files changed

+4307
-4116
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

287 files changed

+4307
-4116
lines changed

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
import pytest
1717
import yaml
1818

19-
RTOL = 0.05
19+
RTOL = 0.08
2020
TEST_DATA_FILE = os.environ.get(
2121
"LM_EVAL_TEST_DATA_FILE",
2222
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

.buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh

+11-4
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,30 @@
55
set -ex
66

77
# Setup cleanup
8-
remove_docker_container() { podman rm -f cpu-test-ubi9-ppc || true; podman system prune -f; }
8+
remove_docker_container() {
9+
if [[ -n "$container_id" ]]; then
10+
podman rm -f "$container_id" || true
11+
fi
12+
podman system prune -f
13+
}
914
trap remove_docker_container EXIT
1015
remove_docker_container
1116

1217
# Try building the docker image
1318
podman build -t cpu-test-ubi9-ppc -f docker/Dockerfile.ppc64le .
1419

1520
# Run the image
16-
podman run -itd --entrypoint /bin/bash -v /tmp/:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --name cpu-test-ubi9-ppc cpu-test-ubi9-ppc
21+
container_id=$(podman run -itd --entrypoint /bin/bash -v /tmp/:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN cpu-test-ubi9-ppc)
1722

1823
function cpu_tests() {
1924

2025
# offline inference
21-
podman exec cpu-test-ubi9-ppc bash -c "
26+
podman exec -it "$container_id" bash -c "
2227
set -e
2328
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"
2429

2530
# Run basic model test
26-
podman exec cpu-test-ubi9-ppc bash -c "
31+
podman exec -it "$container_id" bash -c "
2732
set -e
2833
pip install pytest pytest-asyncio einops peft Pillow soundfile transformers_stream_generator matplotlib
2934
pip install sentence-transformers datamodel_code_generator
@@ -33,6 +38,8 @@ function cpu_tests() {
3338
}
3439

3540
# All of CPU tests are expected to be finished less than 40 mins.
41+
42+
export container_id
3643
export -f cpu_tests
3744
timeout 40m bash -c cpu_tests
3845

.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh

+4-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ docker run --privileged --net host --shm-size=16G -it \
1919
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
2020
&& python3 -m pip install pytest pytest-asyncio tpu-info \
2121
&& python3 -m pip install lm_eval[api]==0.4.4 \
22+
&& export VLLM_XLA_CACHE_PATH= \
2223
&& export VLLM_USE_V1=1 \
2324
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
2425
&& echo HARDWARE \
@@ -44,7 +45,9 @@ docker run --privileged --net host --shm-size=16G -it \
4445
&& echo TEST_9 \
4546
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py \
4647
&& echo TEST_10 \
47-
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" \
48+
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py \
49+
&& echo TEST_11 \
50+
&& pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py" \
4851

4952

5053
# TODO: This test fails because it uses RANDOM_SEED sampling

.buildkite/test-pipeline.yaml

+39-3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
# Documentation
99
# label(str): the name of the test. emoji allowed.
1010
# fast_check(bool): whether to run this on each commit on fastcheck pipeline.
11+
# torch_nightly(bool): whether to run this on vllm against torch nightly pipeline.
1112
# fast_check_only(bool): run this test on fastcheck pipeline only
1213
# optional(bool): never run this test by default (i.e. need to unblock manually) unless it's scheduled nightly run.
1314
# command(str): the single command to run for tests. incompatible with commands.
@@ -70,6 +71,7 @@ steps:
7071
- label: Basic Correctness Test # 30min
7172
#mirror_hardwares: [amd]
7273
fast_check: true
74+
torch_nightly: true
7375
source_file_dependencies:
7476
- vllm/
7577
- tests/basic_correctness/test_basic_correctness
@@ -106,6 +108,7 @@ steps:
106108
- label: Entrypoints Test # 40min
107109
working_dir: "/vllm-workspace/tests"
108110
fast_check: true
111+
torch_nightly: true
109112
#mirror_hardwares: [amd]
110113
amd_gpus: 2 # Just for the sake of queue testing
111114
source_file_dependencies:
@@ -210,6 +213,7 @@ steps:
210213
- pytest -v -s v1/worker
211214
- pytest -v -s v1/structured_output
212215
- pytest -v -s v1/spec_decode
216+
- pytest -v -s v1/test_serial_utils.py
213217
- pytest -v -s v1/test_stats.py
214218
- pytest -v -s v1/test_utils.py
215219
- pytest -v -s v1/test_oracle.py
@@ -327,11 +331,43 @@ steps:
327331
amd_gpus: 8
328332
source_file_dependencies:
329333
- csrc/
334+
- tests/kernels/core
335+
commands:
336+
- pytest -v -s kernels/core
337+
338+
- label: Kernels Attention Test %N
339+
source_file_dependencies:
340+
- csrc/attention/
330341
- vllm/attention
331-
- tests/kernels
342+
- vllm/v1/attention
343+
- tests/kernels/attention
332344
commands:
333-
- pytest -v -s kernels --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
334-
parallelism: 4
345+
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
346+
parallelism: 2
347+
348+
- label: Kernels Quantization Test %N
349+
source_file_dependencies:
350+
- csrc/quantization/
351+
- vllm/model_executor/layers/quantization
352+
- tests/kernels/quantization
353+
commands:
354+
- pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
355+
parallelism: 2
356+
357+
- label: Kernels MoE Test
358+
source_file_dependencies:
359+
- csrc/moe/
360+
- tests/kernels/moe
361+
- vllm/model_executor/layers/fused_moe/
362+
commands:
363+
- pytest -v -s kernels/moe
364+
365+
- label: Kernels Mamba Test
366+
source_file_dependencies:
367+
- csrc/mamba/
368+
- tests/kernels/mamba
369+
commands:
370+
- pytest -v -s kernels/mamba
335371

336372
- label: Tensorizer Test # 11min
337373
working_dir: "/vllm-workspace/tests"

.github/mergify.yml

+32-2
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,19 @@ pull_request_rules:
5555
description: Automatically apply structured-output label
5656
conditions:
5757
- or:
58+
- files~=^benchmarks/structured_schemas/
59+
- files=benchmarks/benchmark_serving_structured_output.py
60+
- files=benchmarks/run_structured_output_benchmark.sh
61+
- files=docs/source/features/structured_outputs.md
62+
- files=examples/offline_inference/structured_outputs.py
63+
- files=examples/online_serving/openai_chat_completion_structured_outputs.py
64+
- files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py
5865
- files~=^vllm/model_executor/guided_decoding/
5966
- files=tests/model_executor/test_guided_processors.py
6067
- files=tests/entrypoints/llm/test_guided_generate.py
61-
- files=benchmarks/benchmark_serving_guided.py
62-
- files=benchmarks/benchmark_guided.py
68+
- files~=^tests/v1/structured_output/
69+
- files=tests/v1/entrypoints/llm/test_guided_generate.py
70+
- files~=^vllm/v1/structured_output/
6371
actions:
6472
label:
6573
add:
@@ -118,6 +126,28 @@ pull_request_rules:
118126
remove:
119127
- tpu
120128

129+
- name: label-tool-calling
130+
description: Automatically add tool-calling label
131+
conditions:
132+
- or:
133+
- files~=^tests/tool_use/
134+
- files~=^tests/mistral_tool_use/
135+
- files~=^tests/entrypoints/openai/tool_parsers/
136+
- files=tests/entrypoints/openai/test_chat_with_tool_reasoning.py
137+
- files~=^vllm/entrypoints/openai/tool_parsers/
138+
- files=docs/source/features/tool_calling.md
139+
- files=docs/source/getting_started/examples/openai_chat_completion_client_with_tools.md
140+
- files=docs/source/getting_started/examples/chat_with_tools.md
141+
- files~=^examples/tool_chat_*
142+
- files=examples/offline_inference/chat_with_tools.py
143+
- files=examples/online_serving/openai_chat_completion_client_with_tools_required.py
144+
- files=examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py
145+
- files=examples/online_serving/openai_chat_completion_client_with_tools.py
146+
actions:
147+
label:
148+
add:
149+
- tool-calling
150+
121151
- name: ping author on conflicts and add 'needs-rebase' label
122152
conditions:
123153
- conflict

benchmarks/benchmark_serving.py

+20-11
Original file line numberDiff line numberDiff line change
@@ -713,7 +713,7 @@ def main(args: argparse.Namespace):
713713
))
714714

715715
# Save config and results to json
716-
if args.save_result:
716+
if args.save_result or args.append_result:
717717
result_json: dict[str, Any] = {}
718718

719719
# Setup
@@ -734,6 +734,14 @@ def main(args: argparse.Namespace):
734734
raise ValueError(
735735
"Invalid metadata format. Please use KEY=VALUE format."
736736
)
737+
# Traffic
738+
result_json["request_rate"] = (args.request_rate if args.request_rate
739+
< float("inf") else "inf")
740+
result_json["burstiness"] = args.burstiness
741+
result_json["max_concurrency"] = args.max_concurrency
742+
743+
# Merge with benchmark result
744+
result_json = {**result_json, **benchmark_result}
737745

738746
if not args.save_detailed:
739747
# Remove fields with too many data points
@@ -744,15 +752,6 @@ def main(args: argparse.Namespace):
744752
if field in result_json:
745753
del result_json[field]
746754

747-
# Traffic
748-
result_json["request_rate"] = (args.request_rate if args.request_rate
749-
< float("inf") else "inf")
750-
result_json["burstiness"] = args.burstiness
751-
result_json["max_concurrency"] = args.max_concurrency
752-
753-
# Merge with benchmark result
754-
result_json = {**result_json, **benchmark_result}
755-
756755
# Save to file
757756
base_model_id = model_id.split("/")[-1]
758757
max_concurrency_str = (f"-concurrency{args.max_concurrency}"
@@ -762,7 +761,12 @@ def main(args: argparse.Namespace):
762761
file_name = args.result_filename
763762
if args.result_dir:
764763
file_name = os.path.join(args.result_dir, file_name)
765-
with open(file_name, "w", encoding='utf-8') as outfile:
764+
with open(file_name,
765+
mode="a+" if args.append_result else "w",
766+
encoding='utf-8') as outfile:
767+
# Append a newline.
768+
if args.append_result and outfile.tell() != 0:
769+
outfile.write("\n")
766770
json.dump(result_json, outfile)
767771
save_to_pytorch_benchmark_format(args, result_json, file_name)
768772

@@ -894,6 +898,11 @@ def main(args: argparse.Namespace):
894898
help="When saving the results, whether to include per request "
895899
"information such as response, error, ttfs, tpots, etc.",
896900
)
901+
parser.add_argument(
902+
"--append-result",
903+
action="store_true",
904+
help="Append the benchmark result to the existing json file.",
905+
)
897906
parser.add_argument(
898907
"--metadata",
899908
metavar="KEY=VALUE",

benchmarks/benchmark_serving_structured_output.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
except ImportError:
5252
from argparse import ArgumentParser as FlexibleArgumentParser
5353

54-
from vllm.v1.structured_output.utils import (
54+
from vllm.v1.structured_output.backend_xgrammar import (
5555
has_xgrammar_unsupported_json_features)
5656

5757
MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@@ -150,17 +150,17 @@ def get_schema(index: int):
150150

151151
elif args.dataset == "grammar":
152152
schema = """
153-
?start: select_statement
153+
root ::= select_statement
154154
155-
?select_statement: "SELECT " column_list " FROM " table_name
155+
select_statement ::= "SELECT " column " from " table " where " condition
156156
157-
?column_list: column_name ("," column_name)*
157+
column ::= "col_1 " | "col_2 "
158158
159-
?table_name: identifier
159+
table ::= "table_1 " | "table_2 "
160160
161-
?column_name: identifier
161+
condition ::= column "= " number
162162
163-
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
163+
number ::= "1 " | "2 "
164164
"""
165165
prompt = "Generate an SQL query to show the 'username' \
166166
and 'email' from the 'users' table."

benchmarks/kernels/benchmark_lora.py

+8-2
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,14 @@
1717
from utils import ArgPool, Bench, CudaGraphBenchParams
1818
from weight_shapes import WEIGHT_SHAPES
1919

20-
from vllm.lora.ops.triton_ops import LoRAKernelMeta, lora_expand, lora_shrink
21-
from vllm.lora.ops.triton_ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT
20+
from vllm.triton_utils import HAS_TRITON
21+
22+
if HAS_TRITON:
23+
from vllm.lora.ops.triton_ops import (LoRAKernelMeta, lora_expand,
24+
lora_shrink)
25+
from vllm.lora.ops.triton_ops.utils import (_LORA_A_PTR_DICT,
26+
_LORA_B_PTR_DICT)
27+
2228
from vllm.utils import FlexibleArgumentParser
2329

2430
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())

cmake/external_projects/vllm_flash_attn.cmake

+1-1
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ else()
3838
FetchContent_Declare(
3939
vllm-flash-attn
4040
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
41-
GIT_TAG 0a721daebe4fa7149f06ecf3d3eabeb6dcd0f1fa
41+
GIT_TAG 8798f27777fb57f447070301bf33a9f9c607f491
4242
GIT_PROGRESS TRUE
4343
# Don't share the vllm-flash-attn build between build types
4444
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn

docker/Dockerfile

+6
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,9 @@ ENV UV_HTTP_TIMEOUT=500
162162
COPY requirements/lint.txt requirements/lint.txt
163163
COPY requirements/test.txt requirements/test.txt
164164
COPY requirements/dev.txt requirements/dev.txt
165+
# Workaround for #17068
166+
RUN --mount=type=cache,target=/root/.cache/uv \
167+
uv pip install --system mamba-ssm==2.2.4 --no-build-isolation
165168
RUN --mount=type=cache,target=/root/.cache/uv \
166169
uv pip install --system -r requirements/dev.txt
167170
#################### DEV IMAGE ####################
@@ -265,6 +268,9 @@ ADD . /vllm-workspace/
265268
ENV UV_HTTP_TIMEOUT=500
266269

267270
# install development dependencies (for testing)
271+
# Workaround for #17068
272+
RUN --mount=type=cache,target=/root/.cache/uv \
273+
uv pip install --system mamba-ssm==2.2.4 --no-build-isolation
268274
RUN --mount=type=cache,target=/root/.cache/uv \
269275
uv pip install --system -r requirements/dev.txt
270276

0 commit comments

Comments
 (0)