Skip to content

[Usage] Qwen3 Usage Guide #17327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
simon-mo opened this issue Apr 28, 2025 · 77 comments
Open

[Usage] Qwen3 Usage Guide #17327

simon-mo opened this issue Apr 28, 2025 · 77 comments
Labels
usage How to use vllm

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 28, 2025

vLLM v0.8.4 and higher natively supports all Qwen3 and Qwen3MoE models. Example command:

File ".../vllm/model_executor/parameter.py", line 149, in load_qkv_weight
    param_data = param_data.narrow(self.output_dim, shard_offset,
IndexError: start out of range (expected to be in range of [-18, 18], but got 2048)
  • If you are seeing the following error when running MoE models with fp8, you are running with too much tensor parallelize degree that the weights are not divisible. Consider --tensor-parallel-size 4 or --tensor-parallel-size 8 --enable-expert-parallel.
File ".../vllm/vllm/model_executor/layers/quantization/fp8.py", line 477, in create_weights
    raise ValueError(
ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
@sp1cae
Copy link

sp1cae commented Apr 29, 2025

how to use MCP with Qwen3

@NaiveYan
Copy link

Any plan for speculative decoding?

@wuzechuan
Copy link

vllm的启动参数如何支持enable_thinking=True?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 29, 2025

vllm的启动参数如何支持enable_thinking=True?

See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes

@thiner
Copy link

thiner commented Apr 29, 2025

Consider --tensor-parallel-size 4 or --tensor-parallel-size 8 --enable-expert-parallel.

I am running Qwen3-30B-A3B-FP8 with two A10 GPUs. tp=2 is enough to load the model, does vllm support "tp=2" in this case?

@cmpunk-bitw
Copy link

How can I disable reasoning in generative models, i.e. using LLM.chat?

@LiuzRush
Copy link

so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?

Image

@255doesnotexist
Copy link

255doesnotexist commented Apr 29, 2025

so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?

Refer to the release notes of 0.8.5 (top line), Yes.

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).

linear.py fixes already existed in #17318

@LiuzRush
Copy link

I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer.

so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?

Refer to the release notes of 0.8.5 (top line), Yes.

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).

linear.py fixes already existed in #17318

I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer.

@Vincentdu-cn
Copy link

vllm的启动参数如何支持enable_thinking=True?

See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes

The official documentation only describes how to turn off Thinking mode when the API is called, it doesn't write how to turn off Thinking mode as soon as vLLM is started, I tried changing the generation_config.json file to turn off Thinking, but it didn't work, it's still in Thinking mode. I also tried adding “chat_template_kwargs”: {“enable_thinking”: false} via the --override-generation-config parameter, but I don't know the correct usage of this parameter and it keeps giving me errors. Here is my generation_config.json and docker startup command:
generation_config.json:

{
    "bos_token_id": 151643,
    "do_sample": true,
    "eos_token_id": [
        151645,
        151643
    ],
    "pad_token_id": 151643,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "max_tokens": 8192,
    "presence_penalty": 1.5,
    "chat_template_kwargs": {"enable_thinking": false},
    "transformers_version": "4.51.0"
}

docker run

 docker run -d --name Qwen --runtime nvidia  --gpus '"device=0,1,2,3"' \
      -v /home/models/Qwen3-32B:/root/.cache/modelscope/hub/Qwen3-32B \
      -p 18081:8081 \
      --ipc=host  \
      vllm/vllm-openai:v0.8.5 \
      --model /root/.cache/modelscope/hub/Qwen3-32B \
      --served-model-name Qwen3-32B \
      --enable-auto-tool-choice --tool-call-parser hermes \
      --chat-template examples/tool_chat_template_hermes.jinja \
      --gpu-memory-utilization 0.9 \
      --tensor-parallel-size 4 \
      --port 8081

ERROR:
--override-generation-config "{'temperature': 0.7,'top_p': 0.8,'top_k': 20,'max_tokens': 8192,'presence_penalty': 1.5,'chat_template_kwargs': {'enable_thinking': false}}"

api_server.py: error: argument --override-generation-config: invalid loads value: "{'temperature': 0.7,'top_p': 0.8,'top_k': 20,'max_tokens': 8192,'presence_penalty': 1.5,'chat_template_kwargs': {'enable_thinking': false}}"

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 29, 2025

How can I disable reasoning in generative models, i.e. using LLM.chat?

I have opened #17356 to support this, can you try it?

@linnnff
Copy link

linnnff commented Apr 29, 2025

how to use tool ?

@GamePP
Copy link

GamePP commented Apr 29, 2025

Qwen3 support MCP
add

--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \

I'm use docker 0.8.5 to infer

docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

GPUs:
4090 48G x 8

@Matriv-org
Copy link

I'm doing something wrong or support isn't released atm ?
I'm using VLLM 0.8.6dev
RTX 5090, Torch 2.7, cu128, bitsandbytes 0.45.5

    "vllm", "serve", "unsloth/Qwen3-30B-A3B-bnb-4bit",
    "--max-model-len", "2048",
    "--enable-reasoning",
    "--reasoning-parser", "deepseek_r1",
    "--download-dir", "./models",
    "--gpu-memory-utilization", "0.7",
    "--max-num-seqs", "5",
Error:
.../vllm/model_executor/layers/fused_moe/layer.py", line 499, in __init__
    assert self.quant_method is not None

@DarkLight1337
Copy link
Member

when I use vLLM as Python Library, how can i switch qwen to no-thinking modes?

See #17356

@2646308870
Copy link

I was using the latest version 0.85 of vllm and running QWEN3-14b_5km. An error was reported. What's the problem? Does vllm currently not support the gguf format of qwen3?
INFO 04-30 00:22:41 [__init__.py:239] Automatically detected platform cuda. INFO 04-30 00:22:43 [api_server.py:1043] vLLM API server version 0.8.5 INFO 04-30 00:22:43 [api_server.py:1044] args: Namespace(host='wslkali', port=12345, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/kali/models/Qwen3-14B-GGUF/Qwen3-14B-Q5_K_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser='deepseek_r1', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=1.0, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['Qwen3-14B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=True, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False) Traceback (most recent call last): File "/root/miniconda3/envs/demo/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/demo/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module> uvloop.run(run_server(args)) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server async with build_async_engine_client(args) as engine_client: File "/root/miniconda3/envs/demo/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/miniconda3/envs/demo/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args vllm_config = engine_args.create_engine_config(usage_context=usage_context) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1099, in create_engine_config model_config = self.create_model_config() File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 987, in create_model_config return ModelConfig( File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/config.py", line 451, in __init__ hf_config = get_config(self.hf_config_path or self.model, File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 303, in get_config config_dict, _ = PretrainedConfig.get_config_dict( File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/configuration_utils.py", line 681, in _get_config_dict config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"] File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 401, in load_gguf_checkpoint raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.") ValueError: GGUF model with architecture qwen3 is not supported yet.

@muellerzr
Copy link

Do we have a method yet to do similar decoding to what Qwen does with their demo via a "reasoning budget"? E.g. injecting in the /think after xyz tokens

@zh794390558
Copy link

How to build 0.8.5 with CUDA11.7?

@Silencezjl
Copy link

我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB

请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4 来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。

@GamePP
Copy link

GamePP commented Apr 30, 2025

我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB

请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4 来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。

@Silencezjl 两个方案:

  1. 使用Qwen3-235B-A22B-FP8,类似我的 4090 48GB x 8的方案。参数
docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

你将--tensor-parallel-size 8改为4

  1. 使用--cpu-offload-gb参数将每张卡应该装载到显存的模型卸载到CPU内存中,这个性能受限于PCIe速度和内存带宽,卸载到内存中的模型每次前向传播都会装载到GPU的显存中。

@cpwan
Copy link

cpwan commented Apr 30, 2025

vLLM: v0.8.5
Model: Qwen/Qwen3-30B-A3B
Hardware: A10*4, 96GB VRAM

Gives OOM, even i set max-model-len to 1024, with max-num-seq =1

Works with enforce-eager. Gives 20 token per seconds.

logs
qwen3-1  | INFO 04-29 00:57:42 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | INFO 04-29 00:57:50 [api_server.py:1043] vLLM API server version 0.8.5
qwen3-1  | INFO 04-29 00:57:50 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='sk-secret', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen3-30B-A3B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=64, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
qwen3-1  | INFO 04-29 00:58:00 [config.py:717] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
qwen3-1  | INFO 04-29 00:58:01 [config.py:1770] Defaulting to use mp for distributed inference
qwen3-1  | INFO 04-29 00:58:01 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
qwen3-1  | INFO 04-29 00:58:07 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | INFO 04-29 00:58:10 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
qwen3-1  | WARNING 04-29 00:58:10 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
qwen3-1  | INFO 04-29 00:58:10 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_00050c59'), local_subscribe_addr='ipc:///tmp/48ef9229-a171-45ae-9528-eff23807c2cd', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1  | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa1d4edf740>
qwen3-1  | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fed8a34d8e0>
qwen3-1  | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fef3bf9f470>
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e2d147db'), local_subscribe_addr='ipc:///tmp/00d5e6a3-8ea8-46e5-a707-3e899932b09c', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7bfd0441'), local_subscribe_addr='ipc:///tmp/ce06e2d4-7ad0-4c0c-8aac-495426086826', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_58f60516'), local_subscribe_addr='ipc:///tmp/370530fd-a341-43cd-8f1b-a5d04c639836', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f72c0ff0230>
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a5ec45d6'), local_subscribe_addr='ipc:///tmp/dccddf7d-b44d-4f23-9a23-51ce8ef80739', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:21 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_d0f9f695'), local_subscribe_addr='ipc:///tmp/22358ed2-dec6-4e1b-812d-bb6df0196376', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:37<09:23, 37.54s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:59<06:33, 28.13s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  19% Completed | 3/16 [01:37<07:08, 32.97s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  25% Completed | 4/16 [02:15<06:56, 34.71s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  31% Completed | 5/16 [02:54<06:38, 36.22s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  38% Completed | 6/16 [03:32<06:08, 36.80s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  44% Completed | 7/16 [04:09<05:33, 37.05s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 8/16 [04:48<05:00, 37.55s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  56% Completed | 9/16 [05:26<04:24, 37.72s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  62% Completed | 10/16 [06:04<03:47, 38.00s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  69% Completed | 11/16 [06:43<03:10, 38.16s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  75% Completed | 12/16 [07:21<02:32, 38.21s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  81% Completed | 13/16 [07:59<01:54, 38.03s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  88% Completed | 14/16 [08:06<00:57, 28.57s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards:  94% Completed | 15/16 [08:44<00:31, 31.60s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 16/16 [09:18<00:00, 32.19s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 16/16 [09:18<00:00, 34.89s/it]
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m 
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.37 seconds
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.63 seconds
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.63 seconds
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.57 seconds
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 558.868858 seconds
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.045517 seconds
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.043662 seconds
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.041898 seconds
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_1_0 for vLLM's torch.compile
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_2_0 for vLLM's torch.compile
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_3_0 for vLLM's torch.compile
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_0_0 for vLLM's torch.compile
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.06 s
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.05 s
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.06 s
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.05 s
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:09:00 [backends.py:148] Compiling a graph for general shape takes 56.80 s
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:09:01 [backends.py:148] Compiling a graph for general shape takes 57.65 s
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:09:02 [backends.py:148] Compiling a graph for general shape takes 58.92 s
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:09:02 [backends.py:148] Compiling a graph for general shape takes 59.09 s
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 80.14 s in total
qwen3-1  | �[1;36m(VllmWorker rank=2 pid=107)�[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 77.85 s in total
qwen3-1  | �[1;36m(VllmWorker rank=0 pid=105)�[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 78.70 s in total
qwen3-1  | �[1;36m(VllmWorker rank=3 pid=108)�[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 79.98 s in total
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1  | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] WorkerProc hit an exception.
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] Traceback (most recent call last):
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     output = func(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 242, in compile_or_warm_up_model
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     self.model_runner.capture_model()
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1678, in capture_model
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     self._dummy_run(num_tokens)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return func(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1497, in _dummy_run
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     outputs = model(
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]               ^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 509, in forward
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 245, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     model_output = self.forward(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 350, in forward
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     def forward(
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return fn(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 822, in call_wrapped
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self._wrapped_call(self, *args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     raise e
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 387, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "<eval_with_key>.98", line 449, in forward
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w13_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w2_weight_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w13_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w2_weight_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_k_norm_parameters_weight_ = None
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 653, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return entry.runnable(*args)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return fn(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1184, in forward
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return compiled_fn(full_args)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 323, in runtime_wrapper
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     all_outs = call_func_at_runtime_with_args(
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     out = normalize_as_list(f(args))
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]                             ^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 672, in inner_fn
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     outs = compiled_fn(args)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 490, in wrapper
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return compiled_fn(runtime_args)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 466, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self.current_callable(inputs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2128, in run
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return model(new_inputs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/root/.cache/vllm/torch_compile_cache/5477567bed/rank_1_0/inductor_cache/5x/c5xqlf36yrrwuc3hzszl5bgdec6wpaa5dackr5bj6hs27gla47b3.py", line 634, in call
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     torch.ops.vllm.inplace_fused_experts.default(buf7, arg6_1, arg7_1, buf15, buf1, 'silu', False, False, False, False, False, False, 128)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 723, in __call__
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     return self._op(*args, **kwargs)
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 986, in inplace_fused_experts
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1295, in fused_experts_impl
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]     cache13 = torch.empty(M * top_k_num * max(N, K),
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1  | �[1;36m(VllmWorker rank=1 pid=106)�[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 21.99 GiB of which 5.44 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 18.91 GiB is allocated by PyTorch, with 31.88 MiB allocated in private pools (e.g., CUDA Graphs), and 128.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@zh794390558
Copy link

How can I disable reasoning in generative models, i.e. using LLM.chat?

I have opened #17356 to support this, can you try it?

How to disable thinking in generate api?

@DarkLight1337
Copy link
Member

The thinking switch is based on chat template. So if you must use LLM.generate instead of LLM.chat, you can call tokenizer.apply_chat_template manually just like in HF repo before passing the prompt to LLM.generate

@Silencezjl
Copy link

我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB
请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4 来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。

@Silencezjl 两个方案:

  1. 使用Qwen3-235B-A22B-FP8,类似我的 4090 48GB x 8的方案。参数
docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

你将--tensor-parallel-size 8改为4

  1. 使用--cpu-offload-gb参数将每张卡应该装载到显存的模型卸载到CPU内存中,这个性能受限于PCIe速度和内存带宽,卸载到内存中的模型每次前向传播都会装载到GPU的显存中。

@GamePP 谢谢回复,FP8确实可以跑通。

@Gyangu
Copy link

Gyangu commented Apr 30, 2025

I have an issue with the thinking budget control in qwen3.

I noticed that Alibaba Cloud's API has this parameter called "thinking_budget" but I can't find anything like that in the open-source docs. When I try adding this parameter to my code, it doesn't seem to do anything. Does the open-source model have this parameter?

@yourchanges
Copy link

Qwen3 support MCP add

--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \

I'm use docker 0.8.5 to infer

docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

GPUs: 4090 48G x 8

Could you share the token eval speed info?

@thnguyen996
Copy link

I was able to run Qwen3-235B-A22B-FP8 model on 4 H100 GPUs but the throughput is so low. Getting round 2 to 3 tokens/s, which is unusable. I tried to switch to V0 version and the throughput improved to ~31 tokens/s, however, when the input prompt gets long, it falls back to low throughput ~2 tokens/sec. Anyone has any idea why?
Here is my command:

export VLLM_USE_V1=0
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel
--max-model-len 10000

vllm version is 0.8.5

@bedovyy
Copy link

bedovyy commented May 7, 2025

nytopop/Qwen3-30B-A3B.w4a16

Did this work for you? I could not make it work on my 24 GB GPU.

yes, here is my option running on RTX3090.

VLLM_USE_V1=0 vllm serve <PATH>/nytopop_Qwen3-30B-A3B.w4a16 --gpu-memory-utilization 0.9 --disable-log-request --max-num-seqs 8

and here's log about memory usage.

INFO 05-07 22:17:23 [loader.py:458] Loading weights took 17.28 seconds
WARNING 05-07 22:17:23 [kv_cache.py:128] Using Q scale 1.0 and prob scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure Q/prob scaling factors are available in the fp8 checkpoint.
INFO 05-07 22:17:23 [model_runner.py:1140] Model loading took 15.6841 GiB and 17.586636 seconds
WARNING 05-07 22:17:24 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/bedovyy/Projects/vllm/venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_GeForce_RTX_3090,dtype=int4_w4a16.json
INFO 05-07 22:17:25 [worker.py:287] Memory profiling takes 1.14 seconds
INFO 05-07 22:17:25 [worker.py:287] the current vLLM instance can use total_gpu_memory (23.56GiB) x gpu_memory_utilization (0.90) = 21.20GiB
INFO 05-07 22:17:25 [worker.py:287] model weights take 15.68GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 0.12GiB; the rest of the memory reserved for KV Cache is 5.34GiB.
INFO 05-07 22:17:25 [executor_base.py:112] # cuda blocks: 3645, # CPU blocks: 2730
INFO 05-07 22:17:25 [executor_base.py:117] Maximum concurrency for 40960 tokens per request: 1.42x

@422505006
Copy link

422505006 commented May 8, 2025

我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB
请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4 来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。

@Silencezjl 两个方案:

  1. 使用Qwen3-235B-A22B-FP8,类似我的 4090 48GB x 8的方案。参数
docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

你将--tensor-parallel-size 8改为4

  1. 使用--cpu-offload-gb参数将每张卡应该装载到显存的模型卸载到CPU内存中,这个性能受限于PCIe速度和内存带宽,卸载到内存中的模型每次前向传播都会装载到GPU的显存中。

启动时会有以下错误,你是怎么解决的
Using Q scale 1.0 and prob scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure Q/prob scaling factors are available in the fp8 checkpoint. [fp8_utils.py:431] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=5120,K=5120,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp8_w8a8,block_shape=[128,128].json

@sleepingcat4
Copy link

Consider --tensor-parallel-size 4 or --tensor-parallel-size 8 --enable-expert-parallel.

I am running Qwen3-30B-A3B-FP8 with two A10 GPUs. tp=2 is enough to load the model, does vllm support "tp=2" in this case?

WHAT! That's insane cuz my lambda instance with 8xA100 (80GB) broke and offloaded part of the model on CPU, LOL!

@psych0v0yager
Copy link

Are there any plans in the future to combine the reasoning parser with structured generation in offline mode. IE allow Qwen 3 to generate freely in between the tags, then output a structured json as the final answer?

@thiner
Copy link

thiner commented May 9, 2025

Consider --tensor-parallel-size 4 or --tensor-parallel-size 8 --enable-expert-parallel.

I am running Qwen3-30B-A3B-FP8 with two A10 GPUs. tp=2 is enough to load the model, does vllm support "tp=2" in this case?

WHAT! That's insane cuz my lambda instance with 8xA100 (80GB) broke and offloaded part of the model on CPU, LOL!

I believe you are running 267B model, not the 30B one.

@iEddie-cmd
Copy link

I was able to run Qwen3-235B-A22B-FP8 model on 4 H100 GPUs but the throughput is so low. Getting round 2 to 3 tokens/s, which is unusable. I tried to switch to V0 version and the throughput improved to ~31 tokens/s, however, when the input prompt gets long, it falls back to low throughput ~2 tokens/sec. Anyone has any idea why? Here is my command:

export VLLM_USE_V1=0
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel
--max-model-len 10000
vllm version is 0.8.5

Same for me. Very low thruput.
VLLM_USE_V1=0 VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve '/Qwen3-32B-autoround-4bit-gptq' --max_model_len 32000 --tensor-parallel-size 4 --gpu_memory_utilization 0.9 --max_num_seqs 1

@official-elinas
Copy link

official-elinas commented May 10, 2025

I'm getting ValueError: Model architectures ['Qwen3ForCausalLM'] failed to be inspected. Please check the logs for more details. even though I'm on vllm==0.8.5.post1 with fp8 official from Qwen.

vllm serve /media/elinas/nvme_1/models/Qwen3-32B-FP8/ \
  --host 0.0.0.0 \
  --port 5070 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --max-num-batched-tokens 32768

@DarkLight1337
Copy link
Member

As mentioned in the error message, if you read the full error logs it should tell you why it failed.

@official-elinas
Copy link

As mentioned in the error message, if you read the full error logs it should tell you why it failed.

Not sure what's up with the passive aggressiveness. I read the full stack. It says it's an unsupported architecture.

@DarkLight1337
Copy link
Member

DarkLight1337 commented May 10, 2025

I mean that: above that error message, there should be more lines that are being logged (the error message asks you to check the full logs). Can you show them?

@official-elinas
Copy link

I mean that: above that error message, there should be more lines that are being logged (the error message asks you to check the full logs). Can you show them?

My mistake then. I had vllm working at one point but Qwen3 is not working on other inference engines so I came back to vLLM following the guide using uv: https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html

Full log

INFO 05-09 20:03:19 [__init__.py:239] Automatically detected platform cuda.
WARNING 05-09 20:03:19 [cuda.py:409] Detected different devices in the system: NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090 Ti, NVIDIA GeForce RTX 3090. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 05-09 20:03:23 [api_server.py:1043] vLLM API server version 0.8.5
INFO 05-09 20:03:23 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/media/npetro/nvme_1/models/Qwen3-32B-FP8/', config='', host='0.0.0.0', port=5070, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/npetro/nvme_1/models/Qwen3-32B-FP8/', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=32768, guided_decoding_backend='auto', reasoning_parser='deepseek_r1', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.95, swap_space=4, kv_cache_dtype='fp8', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=32768, max_num_seqs=4, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=True, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x728a6fcfae80>)
ERROR 05-09 20:03:27 [registry.py:355] Error in inspecting model architecture 'Qwen3ForCausalLM'
ERROR 05-09 20:03:27 [registry.py:355] Traceback (most recent call last):
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 587, in _run_in_subprocess
ERROR 05-09 20:03:27 [registry.py:355]     returned.check_returncode()
ERROR 05-09 20:03:27 [registry.py:355]   File "/home/linuxbrew/.linuxbrew/opt/python@3.12/lib/python3.12/subprocess.py", line 502, in check_returncode
ERROR 05-09 20:03:27 [registry.py:355]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 05-09 20:03:27 [registry.py:355] subprocess.CalledProcessError: Command '['/media/npetro/nvme_1/vllm/.venv/bin/python3', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 05-09 20:03:27 [registry.py:355] 
ERROR 05-09 20:03:27 [registry.py:355] The above exception was the direct cause of the following exception:
ERROR 05-09 20:03:27 [registry.py:355] 
ERROR 05-09 20:03:27 [registry.py:355] Traceback (most recent call last):
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 353, in _try_inspect_model_cls
ERROR 05-09 20:03:27 [registry.py:355]     return model.inspect_model_cls()
ERROR 05-09 20:03:27 [registry.py:355]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 324, in inspect_model_cls
ERROR 05-09 20:03:27 [registry.py:355]     return _run_in_subprocess(
ERROR 05-09 20:03:27 [registry.py:355]            ^^^^^^^^^^^^^^^^^^^
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 590, in _run_in_subprocess
ERROR 05-09 20:03:27 [registry.py:355]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 05-09 20:03:27 [registry.py:355] RuntimeError: Error raised in subprocess:
ERROR 05-09 20:03:27 [registry.py:355] Traceback (most recent call last):
ERROR 05-09 20:03:27 [registry.py:355]   File "<frozen runpy>", line 189, in _run_module_as_main
ERROR 05-09 20:03:27 [registry.py:355]   File "<frozen runpy>", line 112, in _get_module_details
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/__init__.py", line 12, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/engine/arg_utils.py", line 31, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     from vllm.executor.executor_base import ExecutorBase
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/executor/executor_base.py", line 16, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     from vllm.model_executor.layers.sampler import SamplerOutput
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/layers/sampler.py", line 15, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     from vllm.model_executor.layers.utils import apply_penalties
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/layers/utils.py", line 7, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     from vllm import _custom_ops as ops
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/vllm/_custom_ops.py", line 1423, in <module>
ERROR 05-09 20:03:27 [registry.py:355]     @register_fake("_moe_C::moe_wna16_marlin_gemm")
ERROR 05-09 20:03:27 [registry.py:355]      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/torch/library.py", line 828, in register
ERROR 05-09 20:03:27 [registry.py:355]     use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/torch/library.py", line 198, in _register_fake
ERROR 05-09 20:03:27 [registry.py:355]     handle = entry.fake_impl.register(func_to_register, source)
ERROR 05-09 20:03:27 [registry.py:355]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 20:03:27 [registry.py:355]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 31, in register
ERROR 05-09 20:03:27 [registry.py:355]     if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
ERROR 05-09 20:03:27 [registry.py:355]        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 20:03:27 [registry.py:355] RuntimeError: operator _moe_C::moe_wna16_marlin_gemm does not exist
ERROR 05-09 20:03:27 [registry.py:355] 
Traceback (most recent call last):
  File "/media/npetro/nvme_1/vllm/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.12/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.12/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/linuxbrew/.linuxbrew/opt/python@3.12/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.12/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1099, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 987, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/config.py", line 517, in __init__
    self.multimodal_config = self._init_multimodal_config(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/config.py", line 586, in _init_multimodal_config
    if self.registry.is_multimodal_model(self.architectures):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 505, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 465, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 415, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen3ForCausalLM'] failed to be inspected. Please check the logs for more details.

I have a feeling this could be CUDA / Pytorch version mismatch issues. This is what I have on my system

ls /usr/local/ | grep cuda
cuda
cuda-11.8
cuda-12
cuda-12.0
cuda-12.1

@DarkLight1337
Copy link
Member

Can you show how you installed this version of vLLM?

@official-elinas
Copy link

Don't fully remember what I did last time in order because I had a regular venv before with pip, then switched to uv and now installed it like this

uv venv --python 3.10.14 --seed
source .venv/bin/activate                                  
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Versions

torch==2.7.0
transformers==4.51.3
vllm==0.8.5.dev572+g246e3e0a3

Logs

INFO 05-09 23:21:52 [__init__.py:248] Automatically detected platform cuda.
WARNING 05-09 23:21:52 [cuda.py:422] Detected different devices in the system: NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090 Ti, NVIDIA GeForce RTX 3090. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 3.43MB/s]
ERROR 05-09 23:22:02 [registry.py:357] Error in inspecting model architecture 'OPTForCausalLM'
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 588, in _run_in_subprocess
ERROR 05-09 23:22:02 [registry.py:357]     returned.check_returncode()
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/subprocess.py", line 457, in check_returncode
ERROR 05-09 23:22:02 [registry.py:357]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 05-09 23:22:02 [registry.py:357] subprocess.CalledProcessError: Command '['/media/npetro/nvme_1/vllm/.venv/bin/python3', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 05-09 23:22:02 [registry.py:357] 
ERROR 05-09 23:22:02 [registry.py:357] The above exception was the direct cause of the following exception:
ERROR 05-09 23:22:02 [registry.py:357] 
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 355, in _try_inspect_model_cls
ERROR 05-09 23:22:02 [registry.py:357]     return model.inspect_model_cls()
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 326, in inspect_model_cls
ERROR 05-09 23:22:02 [registry.py:357]     return _run_in_subprocess(
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 591, in _run_in_subprocess
ERROR 05-09 23:22:02 [registry.py:357]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 05-09 23:22:02 [registry.py:357] RuntimeError: Error raised in subprocess:
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/runpy.py", line 187, in _run_module_as_main
ERROR 05-09 23:22:02 [registry.py:357]     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/runpy.py", line 110, in _get_module_details
ERROR 05-09 23:22:02 [registry.py:357]     __import__(pkg_name)
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/__init__.py", line 12, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/engine/arg_utils.py", line 19, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/config.py", line 31, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.model_executor.layers.quantization import (QUANTIZATION_METHODS,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/__init__.py", line 3, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.model_executor.parameter import (BasevLLMParameter,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/parameter.py", line 9, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.distributed import get_tensor_model_parallel_rank
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/__init__.py", line 3, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from .communication_op import *
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/communication_op.py", line 8, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from .parallel_state import get_tp_group
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/parallel_state.py", line 149, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.platforms import current_platform
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/platforms/__init__.py", line 280, in __getattr__
ERROR 05-09 23:22:02 [registry.py:357]     _current_platform = resolve_obj_by_qualname(
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/utils.py", line 2133, in resolve_obj_by_qualname
ERROR 05-09 23:22:02 [registry.py:357]     module = importlib.import_module(module_name)
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
ERROR 05-09 23:22:02 [registry.py:357]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/platforms/cuda.py", line 15, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     import vllm._C  # noqa
ERROR 05-09 23:22:02 [registry.py:357] ImportError: /media/npetro/nvme_1/vllm/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
ERROR 05-09 23:22:02 [registry.py:357] 
Traceback (most recent call last):
  File "/media/npetro/nvme_1/vllm/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 45, in main
    cmd.subparser_init(subparsers).set_defaults(
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 55, in subparser_init
    return make_arg_parser(serve_parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/cli_args.py", line 246, in make_arg_parser
    parser = AsyncEngineArgs.add_cli_args(parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1515, in add_cli_args
    parser = EngineArgs.add_cli_args(parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 799, in add_cli_args
    vllm_kwargs = get_kwargs(VllmConfig)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 173, in get_kwargs
    default = field.default_factory()
  File "<string>", line 42, in __init__
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/config.py", line 557, in __post_init__
    self.multimodal_config = self._init_multimodal_config()
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/config.py", line 618, in _init_multimodal_config
    if self.registry.is_multimodal_model(self.architectures):
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 506, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 466, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 416, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['OPTForCausalLM'] failed to be inspected. Please check the logs for more details.
~``

@DarkLight1337
Copy link
Member

Can you try doing a clean reinstall? It might be something wrong with your dependencies

@official-elinas
Copy link

official-elinas commented May 10, 2025

I just ran this 5 minutes ago, clean.

uv venv --python 3.10.14 --seed
source .venv/bin/activate                                  
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Edit: It's not working at all now.

Edit 2: Based on your rereleases on GitHub, I did this and seem to get the original stack trace.

install

uv venv --python 3.10.14 --seed
source .venv/bin/activate   
uv pip install https://github.com/vllm-project/vllm/releases/download/v0.8.5.post1/vllm-0.8.5.post1+cu121-cp38-abi3-manylinux1_x86_64.whl

log/error/stack
TRUNCATED - see #17327 (comment)

  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 505, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 465, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 415, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen3ForCausalLM'] failed to be inspected. Please check the logs for more details.

Edit 3: I did use 3.10.14 this time not 3.12.x --- not sure if that's related, only difference.

@DarkLight1337
Copy link
Member

What is the error you get now?

@DarkLight1337
Copy link
Member

cc @mgoin

@arichiardi
Copy link

Apologies for mixing things up here but I was wondering if anybody has had and solved this error

Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_GeForce_RTX_3090,dtype=int4_w4a16.json

#17619

@ChiNoel-osu
Copy link

Error response when {"enable_thinking": False}. I run Qwen3-30B-A3B with --enable-auto-tool-choice --tool-call-parser hermes --enable-expert-parallel --enable-reasoning --reasoning-parser deepseek_r1. And call the endpoint with code below:

from openai import OpenAI
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
model="qwen",
messages=messages,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

I disable thinking mode, the output is below:

In [11]: response
Out[11]: ChatCompletion(id='chatcmpl-2c6150e0b6724d1798f86861e3749d7c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content='To compare 9.11 and 9.8, we can look at the numbers step by step:\n\n1. Both numbers have 9 as the whole number part.\n2. Now compare the decimal parts:\n - 9.11 has 0.11\n - 9.8 has 0.8\n\nSince 0.8 is greater than 0.11, 9.8 is greater than 9.11.\n\n### Final Answer:\n9.8 is greater than 9.11. ✅'), stop_reason=None)], created=1746100069, model='qwen', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=125, prompt_tokens=26, total_tokens=151, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

As it shows, the message content is None, and the real response is in reasoning_content.

Is this a bug?

Looks like one. The parser is not fully compatible with the model yet.

From Qwen3 docs:
Image

@official-elinas
Copy link

@ChiNoel-osu have you tried the Qwen-32B dense model?

@jeeHwon
Copy link

jeeHwon commented May 13, 2025

Apologies for mixing things up here but I was wondering if anybody has had and solved this error

Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_GeForce_RTX_3090,dtype=int4_w4a16.json

#17619

I’m experiencing the same warning as well. The vLLM server runs fine, but this message still appears.

@ChiNoel-osu
Copy link

@ChiNoel-osu have you tried the Qwen-32B dense model?

@official-elinas Yep. Tried on Qwen-32B GPTQ. The response only comes in reasoning_content.

@manfredwang093
Copy link

Qwen3 support MCP add

--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \

I'm use docker 0.8.5 to infer

docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

GPUs: 4090 48G x 8

Hi @GamePP when I tried this docker command, I got below error about argument not found, which seems strange since I am pulling latest vLLM image, also checked these two parameters are there in latest release: https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments

api_server.py: error: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1

Could you kindly advice? Really appreciate it.

@GamePP
Copy link

GamePP commented May 14, 2025

Qwen3 support MCP add

--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \

I'm use docker 0.8.5 to infer

docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

GPUs: 4090 48G x 8

Hi @GamePP when I tried this docker command, I got below error about argument not found, which seems strange since I am pulling latest vLLM image, also checked these two parameters are there in latest release: https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments

api_server.py: error: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1

Could you kindly advice? Really appreciate it.

@manfredwang093 maybe you should try --reasoning-parser qwen3 ?
FROM https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments
[DEPRECATED] The --enable-reasoning flag is deprecated as of v0.8.6. Use --reasoning-parser to specify the reasoning parser backend instead. This flag (--enable-reasoning) will be removed in v0.10.0. When --reasoning-parser is specified, reasoning mode is automatically enabled.

@manfredwang093
Copy link

manfredwang093 commented May 14, 2025

Qwen3 support MCP add

--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \

I'm use docker 0.8.5 to infer

docker run --runtime nvidia --gpus all \
        -d \
        -v /root/models:/models \
        -p 5536:8000 \
        --env "HF_HUB_OFFLINE=1" \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model /models/Qwen/Qwen3-235B-A22B-FP8 \
        --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
        --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
        --served_model_name Qwen3-235B-A22B-FP8 \
        --gpu_memory_utilization 0.9 \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --host 0.0.0.0 \
        --port 8000 \
        --enable-expert-parallel \
        --tensor-parallel-size 8

GPUs: 4090 48G x 8

Hi @GamePP when I tried this docker command, I got below error about argument not found, which seems strange since I am pulling latest vLLM image, also checked these two parameters are there in latest release: https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments
api_server.py: error: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1
Could you kindly advice? Really appreciate it.

@manfredwang093 maybe you should try --reasoning-parser qwen3 ? FROM https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments [DEPRECATED] The --enable-reasoning flag is deprecated as of v0.8.6. Use --reasoning-parser to specify the reasoning parser backend instead. This flag (--enable-reasoning) will be removed in v0.10.0. When --reasoning-parser is specified, reasoning mode is automatically enabled.

Thank you so much for the quick response, I was following what Qwen's official huggingface suggestion:
vllm serve Qwen/Qwen3-32B-AWQ --enable-reasoning --reasoning-parser deepseek_r1
is there any place I can look up the difference between deepseek_r1 Vs qwen3 as reasoning parser? Or if you could kindly share your understanding of the difference?

Besides this parameter not found issue, I am also getting transformer related error with using vllm/vllm-openai:latest which I believe is the same as v0.8.5.post1, like:
ValueError: ..... but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date

When I switch to vllm/vllm-openai:v0.8.5, both issues are gone. @simon-mo could you kindly confirm if there is transformer package issue with image v0.8.5.post1 ?

Thank you both.

@theavgjojo
Copy link

is there any place I can look up the difference between deepseek_r1 Vs qwen3 as reasoning parser? Or if you could kindly share your understanding of the difference?

The reasoning parsers are here. But note the qwen3 parser didn't make it into 0.8.5.post1.

@jaspers1203
Copy link

jaspers1203 commented May 15, 2025

Don't fully remember what I did last time in order because I had a regular venv before with pip, then switched to uv and now installed it like this

uv venv --python 3.10.14 --seed
source .venv/bin/activate                                  
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Versions

torch==2.7.0
transformers==4.51.3
vllm==0.8.5.dev572+g246e3e0a3

Logs

INFO 05-09 23:21:52 [__init__.py:248] Automatically detected platform cuda.
WARNING 05-09 23:21:52 [cuda.py:422] Detected different devices in the system: NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 3090 Ti, NVIDIA GeForce RTX 3090. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 3.43MB/s]
ERROR 05-09 23:22:02 [registry.py:357] Error in inspecting model architecture 'OPTForCausalLM'
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 588, in _run_in_subprocess
ERROR 05-09 23:22:02 [registry.py:357]     returned.check_returncode()
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/subprocess.py", line 457, in check_returncode
ERROR 05-09 23:22:02 [registry.py:357]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 05-09 23:22:02 [registry.py:357] subprocess.CalledProcessError: Command '['/media/npetro/nvme_1/vllm/.venv/bin/python3', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 05-09 23:22:02 [registry.py:357] 
ERROR 05-09 23:22:02 [registry.py:357] The above exception was the direct cause of the following exception:
ERROR 05-09 23:22:02 [registry.py:357] 
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 355, in _try_inspect_model_cls
ERROR 05-09 23:22:02 [registry.py:357]     return model.inspect_model_cls()
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 326, in inspect_model_cls
ERROR 05-09 23:22:02 [registry.py:357]     return _run_in_subprocess(
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 591, in _run_in_subprocess
ERROR 05-09 23:22:02 [registry.py:357]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 05-09 23:22:02 [registry.py:357] RuntimeError: Error raised in subprocess:
ERROR 05-09 23:22:02 [registry.py:357] Traceback (most recent call last):
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/runpy.py", line 187, in _run_module_as_main
ERROR 05-09 23:22:02 [registry.py:357]     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/runpy.py", line 110, in _get_module_details
ERROR 05-09 23:22:02 [registry.py:357]     __import__(pkg_name)
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/__init__.py", line 12, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/engine/arg_utils.py", line 19, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/config.py", line 31, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.model_executor.layers.quantization import (QUANTIZATION_METHODS,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/__init__.py", line 3, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.model_executor.parameter import (BasevLLMParameter,
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/model_executor/parameter.py", line 9, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.distributed import get_tensor_model_parallel_rank
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/__init__.py", line 3, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from .communication_op import *
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/communication_op.py", line 8, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from .parallel_state import get_tp_group
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/distributed/parallel_state.py", line 149, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     from vllm.platforms import current_platform
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/platforms/__init__.py", line 280, in __getattr__
ERROR 05-09 23:22:02 [registry.py:357]     _current_platform = resolve_obj_by_qualname(
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/utils.py", line 2133, in resolve_obj_by_qualname
ERROR 05-09 23:22:02 [registry.py:357]     module = importlib.import_module(module_name)
ERROR 05-09 23:22:02 [registry.py:357]   File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
ERROR 05-09 23:22:02 [registry.py:357]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 05-09 23:22:02 [registry.py:357]   File "/media/npetro/nvme_1/vllm/vllm/platforms/cuda.py", line 15, in <module>
ERROR 05-09 23:22:02 [registry.py:357]     import vllm._C  # noqa
ERROR 05-09 23:22:02 [registry.py:357] ImportError: /media/npetro/nvme_1/vllm/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
ERROR 05-09 23:22:02 [registry.py:357] 
Traceback (most recent call last):
  File "/media/npetro/nvme_1/vllm/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 45, in main
    cmd.subparser_init(subparsers).set_defaults(
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 55, in subparser_init
    return make_arg_parser(serve_parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/cli_args.py", line 246, in make_arg_parser
    parser = AsyncEngineArgs.add_cli_args(parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1515, in add_cli_args
    parser = EngineArgs.add_cli_args(parser)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 799, in add_cli_args
    vllm_kwargs = get_kwargs(VllmConfig)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 173, in get_kwargs
    default = field.default_factory()
  File "<string>", line 42, in __init__
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/config.py", line 557, in __post_init__
    self.multimodal_config = self._init_multimodal_config()
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/config.py", line 618, in _init_multimodal_config
    if self.registry.is_multimodal_model(self.architectures):
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 506, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 466, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
  File "/media/npetro/nvme_1/vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 416, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['OPTForCausalLM'] failed to be inspected. Please check the logs for more details.
~``

vllm-v0.8.5 required torch==2.6.0
I got the same error, and reinstall torch 2.6. it works well~

@puppetm4st3r
Copy link

Some one are tryed the GPTQ version of MoE Qwen 3? a cant get it run, tested on vllm 0.9 and 0.8.5

model: JunHowie/Qwen3-30B-A3B-GPTQ-Int8

vllm parameters:

--download-dir '/data' -tp 2 --model 'JunHowie/Qwen3-30B-A3B-GPTQ-Int8' --dtype auto \
  --gpu-memory-utilization 0.75 --max-model-len 32000 --use-v2-block-manager --enforce-eager \
  --max-log-len 100000 --disable-log-requests --api-key 123 \
  --served-model-name 'gpt-4o' --enable-chunked-prefill --max-num-batched-tokens 512 \
  --max-num-seqs 50 --enable-prefix-caching \
  --tokenizer-pool-size 4"

the stack is:

(VllmWorker rank=1 pid=115) WARNING 05-15 07:17:47 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorker rank=0 pid=114) WARNING 05-15 07:17:47 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorker rank=1 pid=115) INFO 05-15 07:17:47 [gptq_marlin.py:238] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=0 pid=114) INFO 05-15 07:17:47 [gptq_marlin.py:238] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=0 pid=114) INFO 05-15 07:17:48 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=115) INFO 05-15 07:17:48 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435] WorkerProc failed to start.
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435] Traceback (most recent call last):
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 409, in worker_main
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 306, in __init__
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     self.worker.load_model()
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     self.model_runner.load_model()
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     return loader.load_model(vllm_config=vllm_config)
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     loaded_weights = model.load_weights(
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 528, in load_weights
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     return loader.load_weights(weights)
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     yield from self._load_module(prefix,
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     loaded_params = module_load_weights(weights)
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 470, in load_weights
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]     param = params_dict[name]
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435]             ~~~~~~~~~~~^^^^^^
(VllmWorker rank=0 pid=114) ERROR 05-15 07:17:49 [multiproc_executor.py:435] KeyError: 'layers.12.mlp.gate.g_idx'
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=114) 
[rank0]:[W515 07:17:49.461594615 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 05-15 07:17:50 [core.py:396] EngineCore failed to start.
ERROR 05-15 07:17:50 [core.py:396] Traceback (most recent call last):
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-15 07:17:50 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-15 07:17:50 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
Process EngineCore_0:
ERROR 05-15 07:17:50 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
ERROR 05-15 07:17:50 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 05-15 07:17:50 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-15 07:17:50 [core.py:396]     self._init_executor()
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
ERROR 05-15 07:17:50 [core.py:396]     self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 05-15 07:17:50 [core.py:396]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 07:17:50 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
ERROR 05-15 07:17:50 [core.py:396]     raise e from None
ERROR 05-15 07:17:50 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
    self.workers = WorkerProc.wait_for_ready(unready_workers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
    raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
    f()
  File "/usr/lib/python3.12/weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown
    for w in self.workers:
             ^^^^^^^^^^^^
AttributeError: 'MultiprocExecutor' object has no attribute 'workers'
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests