-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[Usage] Qwen3 Usage Guide #17327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
how to use MCP with Qwen3 |
Any plan for speculative decoding? |
vllm的启动参数如何支持enable_thinking=True? |
See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes |
I am running |
How can I disable reasoning in generative models, i.e. using |
I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer.
I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer. |
The official documentation only describes how to turn off Thinking mode when the API is called, it doesn't write how to turn off Thinking mode as soon as vLLM is started, I tried changing the
docker run
ERROR:
|
I have opened #17356 to support this, can you try it? |
how to use tool ? |
Qwen3 support MCP
I'm use docker 0.8.5 to infer
GPUs: |
I'm doing something wrong or support isn't released atm ?
|
See #17356 |
I was using the latest version 0.85 of vllm and running QWEN3-14b_5km. An error was reported. What's the problem? Does vllm currently not support the gguf format of qwen3? |
Do we have a method yet to do similar decoding to what Qwen does with their demo via a "reasoning budget"? E.g. injecting in the |
How to build 0.8.5 with CUDA11.7? |
我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB 请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 |
@Silencezjl 两个方案:
你将--tensor-parallel-size 8改为4
|
vLLM: v0.8.5 Gives OOM, even i set max-model-len to 1024, with max-num-seq =1 Works with enforce-eager. Gives 20 token per seconds. logs
|
How to disable thinking in |
The thinking switch is based on chat template. So if you must use |
@GamePP 谢谢回复,FP8确实可以跑通。 |
I have an issue with the thinking budget control in qwen3. I noticed that Alibaba Cloud's API has this parameter called "thinking_budget" but I can't find anything like that in the open-source docs. When I try adding this parameter to my code, it doesn't seem to do anything. Does the open-source model have this parameter? |
Could you share the token eval speed info? |
I was able to run Qwen3-235B-A22B-FP8 model on 4 H100 GPUs but the throughput is so low. Getting round 2 to 3 tokens/s, which is unusable. I tried to switch to V0 version and the throughput improved to ~31 tokens/s, however, when the input prompt gets long, it falls back to low throughput ~2 tokens/sec. Anyone has any idea why? export VLLM_USE_V1=0
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel
--max-model-len 10000 vllm version is 0.8.5 |
yes, here is my option running on RTX3090.
and here's log about memory usage.
|
启动时会有以下错误,你是怎么解决的 |
WHAT! That's insane cuz my lambda instance with 8xA100 (80GB) broke and offloaded part of the model on CPU, LOL! |
Are there any plans in the future to combine the reasoning parser with structured generation in offline mode. IE allow Qwen 3 to generate freely in between the tags, then output a structured json as the final answer? |
I believe you are running 267B model, not the 30B one. |
Same for me. Very low thruput. |
I'm getting
|
As mentioned in the error message, if you read the full error logs it should tell you why it failed. |
Not sure what's up with the passive aggressiveness. I read the full stack. It says it's an unsupported architecture. |
I mean that: above that error message, there should be more lines that are being logged (the error message asks you to check the full logs). Can you show them? |
My mistake then. I had vllm working at one point but Qwen3 is not working on other inference engines so I came back to vLLM following the guide using Full log
I have a feeling this could be CUDA / Pytorch version mismatch issues. This is what I have on my system
|
Can you show how you installed this version of vLLM? |
Don't fully remember what I did last time in order because I had a regular venv before with pip, then switched to uv and now installed it like this
Versions
Logs
|
Can you try doing a clean reinstall? It might be something wrong with your dependencies |
I just ran this 5 minutes ago, clean.
Edit: It's not working at all now. Edit 2: Based on your rereleases on GitHub, I did this and seem to get the original stack trace. install
log/error/stack
Edit 3: I did use 3.10.14 this time not 3.12.x --- not sure if that's related, only difference. |
What is the error you get now? |
cc @mgoin |
Apologies for mixing things up here but I was wondering if anybody has had and solved this error
|
Looks like one. The parser is not fully compatible with the model yet. |
@ChiNoel-osu have you tried the Qwen-32B dense model? |
I’m experiencing the same warning as well. The vLLM server runs fine, but this message still appears. |
@official-elinas Yep. Tried on Qwen-32B GPTQ. The response only comes in |
Hi @GamePP when I tried this docker command, I got below error about argument not found, which seems strange since I am pulling latest vLLM image, also checked these two parameters are there in latest release: https://docs.vllm.ai/en/latest/serving/engine_args.html?utm_source=chatgpt.com#engine-arguments
Could you kindly advice? Really appreciate it. |
@manfredwang093 maybe you should try |
Thank you so much for the quick response, I was following what Qwen's official huggingface suggestion: Besides this parameter not found issue, I am also getting transformer related error with using When I switch to Thank you both. |
The reasoning parsers are here. But note the qwen3 parser didn't make it into 0.8.5.post1. |
vllm-v0.8.5 required torch==2.6.0 |
Some one are tryed the GPTQ version of MoE Qwen 3? a cant get it run, tested on vllm 0.9 and 0.8.5 model: JunHowie/Qwen3-30B-A3B-GPTQ-Int8 vllm parameters:
the stack is:
|
vLLM v0.8.4 and higher natively supports all Qwen3 and Qwen3MoE models. Example command:
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1
--enable-expert-parallel
.If you are seeing the following error when running fp8 dense models, you are running on vLLM v0.8.4. Please upgrade to v0.8.5.
--tensor-parallel-size 4
or--tensor-parallel-size 8 --enable-expert-parallel
.The text was updated successfully, but these errors were encountered: