Skip to content

[New Model]: Add support for rhymes-ai/Aria #337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pkuhxy opened this issue Mar 15, 2025 · 4 comments
Open

[New Model]: Add support for rhymes-ai/Aria #337

pkuhxy opened this issue Mar 15, 2025 · 4 comments

Comments

@pkuhxy
Copy link

pkuhxy commented Mar 15, 2025

Your current environment

昇腾910B3,依赖:torch 2.5.1
torch_npu-2.5.1.dev20250308

vllm 0.7.3+empty

vllm_ascend 0.1.dev1+g233246d

transformers 4.49.0

🐛 Describe the bug

加载LLM:

model_id_or_path = "rhymes-ai/Aria"
llm = LLM(
            model=model_id_or_path,
            tokenizer=model_id_or_path,
            dtype="bfloat16",
            limit_mm_per_prompt={"image": 256},
            enforce_eager=True,
            trust_remote_code=True,
            max_model_len=38400,
            gpu_memory_utilization=0.6,
            tensor_parallel_size=4,
            distributed_executor_backend="mp",
        )

报错信息:

INFO 03-15 11:48:25 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 03-15 11:48:25 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 03-15 11:48:25 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 03-15 11:48:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-15 11:48:25 __init__.py:44] plugin ascend loaded.
INFO 03-15 11:48:25 __init__.py:198] Platform plugin ascend is activated
INFO 03-15 11:48:25 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 03-15 11:48:25 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 03-15 11:48:25 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-15 11:48:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-15 11:48:25 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 03-15 11:48:25 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 03-15 11:48:26 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 03-15 11:48:26 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
INFO 03-15 11:48:26 config.py:2444] Downcasting torch.float32 to torch.bfloat16.
INFO 03-15 11:48:35 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
WARNING 03-15 11:48:35 arg_utils.py:1197] The model has a long context length (38400). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 03-15 11:48:35 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/work/share1/hxy/checkpoints/mllm/Aria-25B', speculative_config=None, tokenizer='/work/share1/hxy/checkpoints/mllm/Aria-25B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=38400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/work/share1/hxy/checkpoints/mllm/Aria-25B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py:29: ResourceWarning: unclosed <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('10.225.16.203', 54171), raddr=('8.8.8.8', 80)>
  get_ip(), get_open_port())
WARNING 03-15 11:48:37 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffaeb0df370>
INFO 03-15 11:48:42 config.py:3054] cudagraph sizes specified by model runner [] is overridden by config []
Loading safetensors checkpoint shards:   0% Completed | 0/12 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/12 [00:03<00:40,  3.67s/it]
Loading safetensors checkpoint shards:  17% Completed | 2/12 [00:07<00:37,  3.77s/it]
Loading safetensors checkpoint shards:  25% Completed | 3/12 [00:09<00:25,  2.84s/it]
Loading safetensors checkpoint shards:  33% Completed | 4/12 [00:12<00:24,  3.05s/it]
Loading safetensors checkpoint shards:  42% Completed | 5/12 [00:14<00:18,  2.64s/it]
Loading safetensors checkpoint shards:  50% Completed | 6/12 [00:16<00:13,  2.30s/it]
Loading safetensors checkpoint shards:  58% Completed | 7/12 [00:19<00:13,  2.64s/it]
Loading safetensors checkpoint shards:  67% Completed | 8/12 [00:23<00:12,  3.06s/it]
Loading safetensors checkpoint shards:  75% Completed | 9/12 [00:27<00:10,  3.41s/it]
Loading safetensors checkpoint shards:  83% Completed | 10/12 [00:31<00:07,  3.50s/it]
Loading safetensors checkpoint shards:  92% Completed | 11/12 [00:35<00:03,  3.59s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:39<00:00,  3.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:39<00:00,  3.29s/it]

Computed max_num_seqs (min(256, 38400 // 65536)) to be less than 1. Setting it to the minimum value of 1.
WARNING 03-15 11:49:54 profiling.py:192] The context length (38400) of the model is too short to hold the multi-modal embeddings in the worst case (65536 tokens in total, out of which {'image': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
[rank0]: Traceback (most recent call last):
[rank0]:     llm = LLM(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 489, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 421, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work/share/projects/hxy/docker/vllm-ascend/vllm_ascend/worker/worker.py", line 227, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work/share/projects/hxy/docker/vllm-ascend/vllm_ascend/worker/model_runner.py", line 1360, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work/share/projects/hxy/docker/vllm-ascend/vllm_ascend/worker/model_runner.py", line 1140, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/aria.py", line 643, in forward
[rank0]:     hidden_states = self.language_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
[rank0]:     hidden_states, residual = layer(positions, hidden_states,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 290, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/aria.py", line 284, in forward
[rank0]:     sparse_expert_output = self.experts(hidden_states, router_output)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 586, in forward
[rank0]:     final_hidden_states = self.quant_method.apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 120, in apply
[rank0]:     return self.forward(x=x,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 25, in forward
[rank0]:     return self._forward_method(*args, **kwargs)
[rank0]:   File "/work/share/projects/hxy/docker/vllm-ascend/vllm_ascend/ops/fused_moe.py", line 171, in forward_oot
[rank0]:     topk_weights, topk_ids = group_topk(
[rank0]:   File "/work/share/projects/hxy/docker/vllm-ascend/vllm_ascend/ops/fused_moe.py", line 56, in group_topk
[rank0]:     group_scores = scores.view(num_token, num_expert_group,
[rank0]: RuntimeError: shape '[38400, 0, -1]' is invalid for input of size 2457600
[ERROR] 2025-03-15-11:50:02 (PID:720632, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

在NPU上加载Aria模型(MOE)时会报错shape is invalid for input of size,但是在GPU上加载时不会报错。 @Yikun

@pkuhxy pkuhxy added the bug Something isn't working label Mar 15, 2025
@Yikun
Copy link
Collaborator

Yikun commented Mar 15, 2025

@pkuhxy Thanks for your feedback, this might be a issue of implementation of fuse_moe topk ops.

cc @SidaoY Would you mind taking a look?

@SidaoY
Copy link
Contributor

SidaoY commented Mar 15, 2025

Currently, only models that work with group topk (like DeepSeek v2 and v3) are supported.

@963658029
Copy link

@SidaoY Could you help to fix it? I also want to use Aria in Ascend. Much thanks!

@pkuhxy
Copy link
Author

pkuhxy commented Mar 15, 2025

Currently, only models that work with group topk (like DeepSeek v2 and v3) are supported.

Thanks for reply,could you help to fix it? @SidaoY

@wangxiyuan wangxiyuan added new model and removed bug Something isn't working labels Mar 17, 2025
@pkuhxy pkuhxy changed the title [Bug]: VLLM在NPU上加载模型报错,在GPU上能够正常加载。 Add support for rhymes-ai/Aria Mar 17, 2025
@wangxiyuan wangxiyuan changed the title Add support for rhymes-ai/Aria [New Model]: Add support for rhymes-ai/Aria May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants