-
Notifications
You must be signed in to change notification settings - Fork 150
[Feature]: Add support for Guided Decoding #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for your investgation. If |
Guided Decoding seems to have subpar performance. Steps to reproduce
Execution script
Also, this warning may be relevant: |
This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use |
This have been tested using xgrammar Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed |
Which version of vllm do you use? note that V1 currently only supports Please show more details for analysis. |
xgrammar==0.1.14
torch==2.5.1
torch-npu==2.5.1.dev20250320
vllm==0.7.3+empty
vllm_ascend==0.7.3rc1 |
I have tried this with your version, and I find After commented the code show below in """The kernels for XGrammar."""
import torch
from .apply_token_bitmask_inplace_cpu import apply_token_bitmask_inplace_cpu
apply_token_bitmask_inplace_kernels = {"cpu": apply_token_bitmask_inplace_cpu}
__all__ = ["apply_token_bitmask_inplace_kernels"]
# if torch.cuda.is_available():
# from .apply_token_bitmask_inplace_cuda import apply_token_bitmask_inplace_cuda
# apply_token_bitmask_inplace_kernels["cuda"] = apply_token_bitmask_inplace_cuda
# try:
# from .apply_token_bitmask_inplace_triton import ( # isort: skip
# apply_token_bitmask_inplace_triton,
# )
# apply_token_bitmask_inplace_kernels["triton"] = apply_token_bitmask_inplace_triton
# except ImportError:
# # If triton is not installed, we can still use the CPU and CUDA implementations.
# pass |
@wilschoo You can use |
what are the vllm and vllm-ascend versions that you are using? |
vllm==0.8.3.dev168+g18ed3132d.empty
vllm-ascend==0.1.dev119+g7beb433.d20250331 |
I have pulled and installed the main branch for both vllm and vllm_ascend: vllm==0.1.dev1+g30d6a01.empty
vllm_ascend=0.1.dev1+g78083d4
xgrammar==0.1.16 By using my serve command vllm serve Qwen/Qwen2.5-32B-Instruct -tp=2 --gpu-memory-utilization 0.95 --max-model-len 32678 --enforce-eager --guided-decoding-backend=xgrammar my TPOT is now seeing ~12 tokens/s compared to ~1 token/s on previous vllm, vllm_ascend versions |
Overview
In our roadmap, we plan to support guided decoding in 2025 Q1 as shown here (#71).
Currently:
I have tested
vllm/examples/offline_inference/structured_outputs.py
directly on NPU device and the experiment results showed that guided decoding is natively supported on NPU withoutlines
backend.Plus, I have analysed the code in vLLM and have found that the tensors related to guide logits computation are all on
npu
device, which have also demonstrated that guided decoding is natively supported on NPU.However, there are still some problems need to be fixed, such as incomplete json output and inference speed is too slow.
Feel free to feedback your issues when using guided decoding with vllm-ascend, and we will try to fix them if we can.
Roadmap
🔥 Latest community news:
structural_tag
support using xgrammar vllm#17085 (🚧 TODO: Need test on vllm-ascend)🚀 Adaptation on vllm-ascend from vllm community:
[V1] Refactor for structured output module:
_validate_structured_output()
vllm#16748backend_xgrammar.py
vllm#16578supports_structured_output()
in platform #531supports_structured_output()
method to Platform #475supports_structured_output()
method to Platform vllm#16148[V1] Bugfix for xgrammar backend:
apply_grammar_bitmask()
method to model runner #555[V1] Bugfix for guidance backend:
>= 0.7.11
) to avoid AttributeError (noStructTag
) vllm#17839Add support for xgrammar backend on aarch64:
Add support for reasoning model (DeepSeek-R1):
The text was updated successfully, but these errors were encountered: