Skip to content

[Feature]: Add support for Guided Decoding #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shen-shanshan opened this issue Feb 26, 2025 · 11 comments
Open

[Feature]: Add support for Guided Decoding #177

shen-shanshan opened this issue Feb 26, 2025 · 11 comments

Comments

@shen-shanshan
Copy link
Collaborator

shen-shanshan commented Feb 26, 2025

Overview

In our roadmap, we plan to support guided decoding in 2025 Q1 as shown here (#71).

Currently:

I have tested vllm/examples/offline_inference/structured_outputs.py directly on NPU device and the experiment results showed that guided decoding is natively supported on NPU with outlines backend.

Plus, I have analysed the code in vLLM and have found that the tensors related to guide logits computation are all on npu device, which have also demonstrated that guided decoding is natively supported on NPU.

However, there are still some problems need to be fixed, such as incomplete json output and inference speed is too slow.

Feel free to feedback your issues when using guided decoding with vllm-ascend, and we will try to fix them if we can.

Roadmap

🔥 Latest community news:


🚀 Adaptation on vllm-ascend from vllm community:


[V1] Refactor for structured output module:

[V1] Bugfix for xgrammar backend:

[V1] Bugfix for guidance backend:


Add support for xgrammar backend on aarch64:

Add support for reasoning model (DeepSeek-R1):


@wangxiyuan
Copy link
Collaborator

Thanks for your investgation. If guided decoding works well with Ascend, please update the feature doc. Thanks

@ej-hw
Copy link

ej-hw commented Mar 26, 2025

Guided Decoding seems to have subpar performance.
~10 tokens/s for normal completion vs ~1 tokens/s for guided decoding.

Image

Steps to reproduce

export IMAGE = quay.io/ascend/vllm-ascend:v0.7.3rc1

docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/  \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 5566:5566 \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE bash -c "vllm serve Qwen/Qwen2.5-32B-Instruct --host=0.0.0.0 --port=5566 --api-key=Aicoe123? -tp=2 --gpu-memory-utilization 0.95 --max-model-len 32678 --enforce-eager --served-model-name=qwen2-5-32b-it"

Execution script

from openai import OpenAI
client = OpenAI(
    base_url="XXXX/v1",
    api_key="XXXX",
)

from pydantic import BaseModel
from enum import Enum

class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"


class CarDescription(BaseModel):
    brand: str
    model: str
    story: str
    car_type: CarType


json_schema = CarDescription.model_json_schema()

completion = client.chat.completions.create(
    model="qwen2-5-32b-it",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model, story and car_type of the most iconic car from the 90's",
        }
    ],
    extra_body={"guided_json": json_schema, 
        "guided_decoding_backend": "xgrammar",},
)
print(completion.choices[0].message.content)

Other info:
Image

Also, this warning may be relevant:
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please
convert the npu tensor to cpu tensor for saving')

@shen-shanshan
Copy link
Collaborator Author

shen-shanshan commented Mar 27, 2025

@ej-hw

This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

@wilschoo
Copy link

@ej-hw

This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.

Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed

Image

@shen-shanshan
Copy link
Collaborator Author

@ej-hw
This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.

Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed

Image

Which version of vllm do you use? note that V1 currently only supports xgrammar, find mote details here.

Please show more details for analysis.

@wilschoo
Copy link

wilschoo commented Apr 1, 2025

@ej-hw
This is due to this PR vllm-project/vllm#13894 is not merged, you can modify the code locally, and use xgrammar>=0.1.14 for a try.

This have been tested using xgrammar (0.1.14 and 0.1.15) and the fallback to outlines code block is commented out.
Outcome was that now xgrammar is used but an error stating "CUDA_HOME not set" was observed
Image

Which version of vllm do you use? note that V1 currently only supports xgrammar, find mote details here.

Please show more details for analysis.

xgrammar==0.1.14
torch==2.5.1
torch-npu==2.5.1.dev20250320
vllm==0.7.3+empty
vllm_ascend==0.7.3rc1

@shen-shanshan
Copy link
Collaborator Author

@wilschoo

I have tried this with your version, and I find xgrammar called apply_token_bitmask_inplace_cuda, which is so wierd (cuda is not available in NPU environment).

After commented the code show below in xgrammar/kernels/__init__.py, it can work:

"""The kernels for XGrammar."""

import torch

from .apply_token_bitmask_inplace_cpu import apply_token_bitmask_inplace_cpu

apply_token_bitmask_inplace_kernels = {"cpu": apply_token_bitmask_inplace_cpu}

__all__ = ["apply_token_bitmask_inplace_kernels"]

# if torch.cuda.is_available():
#     from .apply_token_bitmask_inplace_cuda import apply_token_bitmask_inplace_cuda

#     apply_token_bitmask_inplace_kernels["cuda"] = apply_token_bitmask_inplace_cuda

# try:
#     from .apply_token_bitmask_inplace_triton import (  # isort: skip
#         apply_token_bitmask_inplace_triton,
#     )

#     apply_token_bitmask_inplace_kernels["triton"] = apply_token_bitmask_inplace_triton
# except ImportError:
#     # If triton is not installed, we can still use the CPU and CUDA implementations.
#     pass

@shen-shanshan
Copy link
Collaborator Author

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

@wilschoo
Copy link

wilschoo commented Apr 1, 2025

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

@shen-shanshan
Copy link
Collaborator Author

shen-shanshan commented Apr 1, 2025

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

vllm==0.8.3.dev168+g18ed3132d.empty
vllm-ascend==0.1.dev119+g7beb433.d20250331

@wilschoo
Copy link

wilschoo commented Apr 1, 2025

@wilschoo You can use main branch of vllm and vllm-ascend for a try, it works well with xgrammar 0.16.0.

what are the vllm and vllm-ascend versions that you are using?

vllm==0.8.3.dev168+g18ed3132d.empty
vllm-ascend==0.1.dev119+g7beb433.d20250331

I have pulled and installed the main branch for both vllm and vllm_ascend:

vllm==0.1.dev1+g30d6a01.empty
vllm_ascend=0.1.dev1+g78083d4
xgrammar==0.1.16

By using my serve command

vllm serve Qwen/Qwen2.5-32B-Instruct -tp=2 --gpu-memory-utilization 0.95 --max-model-len 32678 --enforce-eager --guided-decoding-backend=xgrammar

my TPOT is now seeing ~12 tokens/s compared to ~1 token/s on previous vllm, vllm_ascend versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants