-
Notifications
You must be signed in to change notification settings - Fork 158
[Guide]: Usage on Graph mode #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
should be and, when I comma WARNING 05-14 02:54:07 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 05-14 02:54:07 [platform.py:141] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
WARNING 05-14 02:54:07 [platform.py:179] Prefix caching is not supported for V1 now, disable prefix caching
INFO 05-14 02:54:08 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/mnt/models/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='/mnt/models/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/models/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py:29: ResourceWarning: unclosed <socket.socket fd=19, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('10.246.91.186', 45099), raddr=('8.8.8.8', 80)>
get_ip(), get_open_port())
WARNING 05-14 02:54:10 [utils.py:2522] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd0cfa2590>
[rank0]:[W514 02:54:15.919541630 ProcessGroupGloo.cpp:715] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 05-14 02:54:15 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-14 02:54:16 [model_runner_v1.py:852] Starting to load model /mnt/models/Qwen2.5-0.5B-Instruct...
ERROR 05-14 02:54:18 [core.py:396] EngineCore failed to start.
ERROR 05-14 02:54:18 [core.py:396] Traceback (most recent call last):
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-14 02:54:18 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-14 02:54:18 [core.py:396] super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__
ERROR 05-14 02:54:18 [core.py:396] self.model_executor = executor_class(vllm_config)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-14 02:54:18 [core.py:396] self._init_executor()
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 05-14 02:54:18 [core.py:396] self.collective_rpc("load_model")
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-14 02:54:18 [core.py:396] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 05-14 02:54:18 [core.py:396] return func(*args, **kwargs)
ERROR 05-14 02:54:18 [core.py:396] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 178, in load_model
ERROR 05-14 02:54:18 [core.py:396] self.model_runner.load_model()
ERROR 05-14 02:54:18 [core.py:396] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 855, in load_model
ERROR 05-14 02:54:18 [core.py:396] self.model = get_model(vllm_config=self.vllm_config)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 05-14 02:54:18 [core.py:396] return loader.load_model(vllm_config=vllm_config)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
ERROR 05-14 02:54:18 [core.py:396] model = _initialize_model(vllm_config=vllm_config)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
ERROR 05-14 02:54:18 [core.py:396] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 436, in __init__
ERROR 05-14 02:54:18 [core.py:396] self.model = Qwen2Model(vllm_config=vllm_config,
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 162, in __init__
ERROR 05-14 02:54:18 [core.py:396] TorchCompileWrapperWithCustomDispatcher.__init__(
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 42, in __init__
ERROR 05-14 02:54:18 [core.py:396] backend = vllm_config.compilation_config.init_backend(vllm_config)
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 3600, in init_backend
ERROR 05-14 02:54:18 [core.py:396] from vllm.compilation.backends import VllmBackend
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/backends.py", line 20, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .compiler_interface import EagerAdaptor, InductorAdaptor
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 11, in <module>
ERROR 05-14 02:54:18 [core.py:396] import torch._inductor.compile_fx
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 72, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .fx_passes.joint_graph import joint_graph_passes
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 19, in <module>
ERROR 05-14 02:54:18 [core.py:396] from ..pattern_matcher import (
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 96, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .lowering import fallback_node_due_to_unsupported_type
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/lowering.py", line 6430, in <module>
ERROR 05-14 02:54:18 [core.py:396] from . import kernel
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
ERROR 05-14 02:54:18 [core.py:396] from . import mm, mm_common, mm_plus_mm, unpack_mixed_mm
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/kernel/mm.py", line 16, in <module>
ERROR 05-14 02:54:18 [core.py:396] from torch._inductor.codegen.cpp_gemm_template import CppPackedGemmTemplate
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_gemm_template.py", line 19, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .cpp_micro_gemm import CppMicroGemmAMX, create_micro_gemm, LayoutType
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_micro_gemm.py", line 16, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .cpp_template_kernel import CppTemplateKernel
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_template_kernel.py", line 20, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .cpp_wrapper_cpu import CppWrapperCpu
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_wrapper_cpu.py", line 30, in <module>
ERROR 05-14 02:54:18 [core.py:396] from .wrapper import EnterSubgraphLine, ExitSubgraphLine, WrapperCodeGen
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/wrapper.py", line 46, in <module>
ERROR 05-14 02:54:18 [core.py:396] from ..runtime import triton_heuristics
ERROR 05-14 02:54:18 [core.py:396] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 55, in <module>
ERROR 05-14 02:54:18 [core.py:396] from triton import Config
ERROR 05-14 02:54:18 [core.py:396] ImportError: cannot import name 'Config' from 'triton' (unknown location)
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__
self.model_executor = executor_class(vllm_config)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("load_model")
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 178, in load_model
self.model_runner.load_model()
File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 855, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
model = _initialize_model(vllm_config=vllm_config)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 436, in __init__
self.model = Qwen2Model(vllm_config=vllm_config,
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 162, in __init__
TorchCompileWrapperWithCustomDispatcher.__init__(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 42, in __init__
backend = vllm_config.compilation_config.init_backend(vllm_config)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 3600, in init_backend
from vllm.compilation.backends import VllmBackend
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/backends.py", line 20, in <module>
from .compiler_interface import EagerAdaptor, InductorAdaptor
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 11, in <module>
import torch._inductor.compile_fx
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 72, in <module>
from .fx_passes.joint_graph import joint_graph_passes
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 19, in <module>
from ..pattern_matcher import (
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 96, in <module>
from .lowering import fallback_node_due_to_unsupported_type
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/lowering.py", line 6430, in <module>
from . import kernel
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
from . import mm, mm_common, mm_plus_mm, unpack_mixed_mm
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/kernel/mm.py", line 16, in <module>
from torch._inductor.codegen.cpp_gemm_template import CppPackedGemmTemplate
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_gemm_template.py", line 19, in <module>
from .cpp_micro_gemm import CppMicroGemmAMX, create_micro_gemm, LayoutType
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_micro_gemm.py", line 16, in <module>
from .cpp_template_kernel import CppTemplateKernel
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_template_kernel.py", line 20, in <module>
from .cpp_wrapper_cpu import CppWrapperCpu
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/cpp_wrapper_cpu.py", line 30, in <module>
from .wrapper import EnterSubgraphLine, ExitSubgraphLine, WrapperCodeGen
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/codegen/wrapper.py", line 46, in <module>
from ..runtime import triton_heuristics
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 55, in <module>
from triton import Config
ImportError: cannot import name 'Config' from 'triton' (unknown location)
Traceback (most recent call last):
File "/vllm-workspace/./simple_test.py", line 15, in <module>
llm = LLM(model="/mnt/models/Qwen2.5-0.5B-Instruct")
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 247, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 138, in from_engine_args
return cls(vllm_config=vllm_config,
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 92, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 73, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 494, in __init__
super().__init__(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
[ERROR] 2025-05-14-02:54:18 (PID:15115, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
/usr/local/python3.10.17/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpso161dv_'>
_warnings.warn(warn_message, ResourceWarning) https://github.com/vllm-project/vllm-ascend/pull/854/files <- this pr fixed the problem |
Thanks for pointing this! |
Uh oh!
There was an error while loading. Please reload this page.
How to Use Grpah mode on vLLM Ascend
Graph mode is supported experimentally:
1. Graph mode for DeepSeek model:
Software:
Usage:
Set
enable_graph_mode
toTrue
inadditional_config
to enable graph mode for DeepSeek model:For example:
Note
enable_graph_mode
should only be enabled when inferencing with DeepSeek. Other models are not supported.2. Graph mode for dense model:
Software:
Usage:
Step1: enable V1 engine
export VLLM_USE_V1=1
Step2: modify
platform.py
in vllm-ascend to make graph mode workStep3: use
transfer_to_npu
as the following scriptsThe text was updated successfully, but these errors were encountered: