[Feature]: Add support for Guided Decoding #177

shen-shanshan · 2025-02-26T08:56:35Z

Overview

In our roadmap, we plan to support guided decoding in 2025 Q1 as shown here (#71).

Currently:

I have tested vllm/examples/offline_inference/structured_outputs.py directly on NPU device and the experiment results showed that guided decoding is natively supported on NPU with outlines backend.

Plus, I have analysed the code in vLLM and have found that the tensors related to guide logits computation are all on npu device, which have also demonstrated that guided decoding is natively supported on NPU.

However, there are still some problems need to be fixed, such as incomplete json output and inference speed is too slow.

Feel free to feedback your issues when using guided decoding with vllm-ascend, and we will try to fix them if we can.

Roadmap

🔥 Latest community news:

[V1] Add structural_tag support using xgrammar vllm#17085 (🚧 TODO: Need test on vllm-ascend)
[V0][V1][Core] Add outlines integration for V1, and update V0 integration. vllm#15975 (🚧 TODO: Need test on vllm-ascend)
[V1][Experimental] Jump-forward decoding vllm#15490 (🚧 TODO: Need test on vllm-ascend)
[V1][Feature] Enable Speculative Decoding with Structured Outputs vllm#14702 (🚧 TODO: Need adapted to vllm-ascend)

🚀 Adaptation on vllm-ascend from vllm community:

[V1][Structured Output] Enable Speculative Decoding with Structured Outputs #751

[V1] Refactor for structured output module:

[V1] Bugfix for xgrammar backend:

[V1] Bugfix for guidance backend:

[V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) vllm#17839

Add support for xgrammar backend on aarch64:

[V1] Add regex structured output support with xgrammar vllm#14590

Add support for reasoning model (DeepSeek-R1):

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2025-03-03T01:30:21Z

Thanks for your investgation. If guided decoding works well with Ascend, please update the feature doc. Thanks

ej-hw · 2025-03-26T18:16:26Z

Guided Decoding seems to have subpar performance.
~10 tokens/s for normal completion vs ~1 tokens/s for guided decoding.

Steps to reproduce

export IMAGE = quay.io/ascend/vllm-ascend:v0.7.3rc1

docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/  \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 5566:5566 \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE bash -c "vllm serve Qwen/Qwen2.5-32B-Instruct --host=0.0.0.0 --port=5566 --api-key=Aicoe123? -tp=2 --gpu-memory-utilization 0.95 --max-model-len 32678 --enforce-eager --served-model-name=qwen2-5-32b-it"

Execution script

from openai import OpenAI
client = OpenAI(
    base_url="XXXX/v1",
    api_key="XXXX",
)

from pydantic import BaseModel
from enum import Enum

class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"


class CarDescription(BaseModel):
    brand: str
    model: str
    story: str
    car_type: CarType


json_schema = CarDescription.model_json_schema()

completion = client.chat.completions.create(
    model="qwen2-5-32b-it",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model, story and car_type of the most iconic car from the 90's",
        }
    ],
    extra_body={"guided_json": json_schema, 
        "guided_decoding_backend": "xgrammar",},
)
print(completion.choices[0].message.content)

Other info:

Also, this warning may be relevant:
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please
convert the npu tensor to cpu tensor for saving')

shen-shanshan · 2025-03-27T06:36:11Z

@ej-hw

This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

wilschoo · 2025-03-28T10:35:43Z

@ej-hw

This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.

Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed

shen-shanshan · 2025-03-31T07:46:29Z

@ej-hw
This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.

Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed

Which version of vllm do you use? note that V1 currently only supports xgrammar, find mote details here.

Please show more details for analysis.

wilschoo · 2025-04-01T01:15:37Z

@ej-hw
This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.
Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed

Which version of vllm do you use? note that V1 currently only supports xgrammar, find mote details here.

Please show more details for analysis.

xgrammar==0.1.14
torch==2.5.1
torch-npu==2.5.1.dev20250320
vllm==0.7.3+empty
vllm_ascend==0.7.3rc1

shen-shanshan · 2025-04-01T03:57:09Z

@wilschoo

I have tried this with your version, and I find xgrammar called apply_token_bitmask_inplace_cuda, which is so wierd (cuda is not available in NPU environment).

After commented the code show below in xgrammar/kernels/__init__.py, it can work:

"""The kernels for XGrammar."""

import torch

from .apply_token_bitmask_inplace_cpu import apply_token_bitmask_inplace_cpu

apply_token_bitmask_inplace_kernels = {"cpu": apply_token_bitmask_inplace_cpu}

__all__ = ["apply_token_bitmask_inplace_kernels"]

# if torch.cuda.is_available():
#     from .apply_token_bitmask_inplace_cuda import apply_token_bitmask_inplace_cuda

#     apply_token_bitmask_inplace_kernels["cuda"] = apply_token_bitmask_inplace_cuda

# try:
#     from .apply_token_bitmask_inplace_triton import (  # isort: skip
#         apply_token_bitmask_inplace_triton,
#     )

#     apply_token_bitmask_inplace_kernels["triton"] = apply_token_bitmask_inplace_triton
# except ImportError:
#     # If triton is not installed, we can still use the CPU and CUDA implementations.
#     pass

shen-shanshan · 2025-04-01T04:01:18Z

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

wilschoo · 2025-04-01T05:25:52Z

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

shen-shanshan · 2025-04-01T06:18:01Z

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

vllm==0.8.3.dev168+g18ed3132d.empty
vllm-ascend==0.1.dev119+g7beb433.d20250331

wilschoo · 2025-04-01T10:20:25Z

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

vllm==0.8.3.dev168+g18ed3132d.empty
vllm-ascend==0.1.dev119+g7beb433.d20250331

I have pulled and installed the main branch for both vllm and vllm_ascend:

vllm==0.1.dev1+g30d6a01.empty
vllm_ascend=0.1.dev1+g78083d4
xgrammar==0.1.16

By using my serve command

vllm serve Qwen/Qwen2.5-32B-Instruct -tp=2 --gpu-memory-utilization 0.95 --max-model-len 32678 --enforce-eager --guided-decoding-backend=xgrammar

my TPOT is now seeing ~12 tokens/s compared to ~1 token/s on previous vllm, vllm_ascend versions

shen-shanshan added the feature request label Feb 26, 2025

Yikun mentioned this issue Feb 27, 2025

vLLM Ascend Roadmap Q1 2025 #71

Closed

38 tasks

Yikun mentioned this issue Mar 4, 2025

[Doc] Update Feature Support doc #234

Merged

shen-shanshan mentioned this issue Apr 7, 2025

vLLM Ascend Roadmap Q2 2025 #448

Open

38 tasks

shen-shanshan mentioned this issue Apr 23, 2025

[Guide] V1 Engine #414

Open

wangxiyuan mentioned this issue May 14, 2025

[Guide] Official Guide Index #840

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add support for Guided Decoding #177

[Feature]: Add support for Guided Decoding #177

shen-shanshan commented Feb 26, 2025 •

edited

Loading

wangxiyuan commented Mar 3, 2025

ej-hw commented Mar 26, 2025 •

edited

Loading

shen-shanshan commented Mar 27, 2025 •

edited

Loading

wilschoo commented Mar 28, 2025

shen-shanshan commented Mar 31, 2025

wilschoo commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025

wilschoo commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025 •

edited

Loading

wilschoo commented Apr 1, 2025

[Feature]: Add support for Guided Decoding #177

[Feature]: Add support for Guided Decoding #177

Comments

shen-shanshan commented Feb 26, 2025 • edited Loading

Overview

Roadmap

wangxiyuan commented Mar 3, 2025

ej-hw commented Mar 26, 2025 • edited Loading

shen-shanshan commented Mar 27, 2025 • edited Loading

wilschoo commented Mar 28, 2025

shen-shanshan commented Mar 31, 2025

wilschoo commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025

wilschoo commented Apr 1, 2025

shen-shanshan commented Apr 1, 2025 • edited Loading

wilschoo commented Apr 1, 2025

shen-shanshan commented Feb 26, 2025 •

edited

Loading

ej-hw commented Mar 26, 2025 •

edited

Loading

shen-shanshan commented Mar 27, 2025 •

edited

Loading

shen-shanshan commented Apr 1, 2025 •

edited

Loading