-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I mean to post this, its the full log from startup to failure. [2025-04-18 10:50:17,475]::[InvokeAI]::INFO --> Using torch device: MPS
objc[2186]: Class CaptureDelegate is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e66b8) and /opt/local/lib/opencv4/libopencv_videoio.4.9.0.dylib (0x16b7d07d0). One of the two will be used. Which one is undefined.
objc[2186]: Class CVWindow is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6708) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8a78). One of the two will be used. Which one is undefined.
objc[2186]: Class CVView is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6730) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8aa0). One of the two will be used. Which one is undefined.
objc[2186]: Class CVSlider is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6758) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8ac8). One of the two will be used. Which one is undefined.
[2025-04-18 10:50:22,155]::[InvokeAI]::INFO --> Patchmatch initialized
[2025-04-18 10:50:22,746]::[InvokeAI]::INFO --> Loading node pack StableCascade
[2025-04-18 10:50:22,755]::[InvokeAI]::INFO --> Loading node pack CosXLDenoiseLantents
[2025-04-18 10:50:22,756]::[InvokeAI]::ERROR --> Failed to load node pack CosXLDenoiseLantents (may have partially loaded):
Traceback (most recent call last):
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/load_custom_nodes.py", line 69, in load_custom_nodes
spec.loader.exec_module(module)
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/Users/davidburnett/invokeai/nodes/CosXLDenoiseLantents/__init__.py", line 1, in <module>
from .cosXLDenoiseLantents import CosXLDenoiseLatentsInvocation
File "/Users/davidburnett/invokeai/nodes/CosXLDenoiseLantents/cosXLDenoiseLantents.py", line 5, in <module>
from invokeai.app.invocations.latent import DenoiseLatentsInvocation
ModuleNotFoundError: No module named 'invokeai.app.invocations.latent'
[2025-04-18 10:50:22,756]::[InvokeAI]::INFO --> Loading node pack StylePrompts
[2025-04-18 10:50:22,758]::[InvokeAI]::INFO --> Loaded 2 node packs from /Users/davidburnett/invokeai/nodes: StableCascade, StylePrompts
[2025-04-18 10:50:22,794]::[InvokeAI]::INFO --> InvokeAI version 5.10.0
[2025-04-18 10:50:22,794]::[InvokeAI]::INFO --> Root directory = /Users/davidburnett/invokeai
[2025-04-18 10:50:22,795]::[InvokeAI]::INFO --> Initializing database at /Users/davidburnett/invokeai/databases/invokeai.db
[2025-04-18 10:50:22,806]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 10240.00 MB. Heuristics applied: [1].
[2025-04-18 10:50:22,840]::[ModelInstallService]::WARNING --> Missing model file: terminus-xl-velocity-v2 at sdxl/main/terminus-xl-velocity-v2
[2025-04-18 10:50:22,841]::[InvokeAI]::INFO --> Pruned 5 finished queue items
[2025-04-18 10:50:22,903]::[InvokeAI]::INFO --> Invoke running on http://127.0.0.1:9090 (Press CTRL+C to quit)
[2025-04-18 10:52:40,796]::[InvokeAI]::INFO --> Executing queue item 5816, session b7166adc-671c-4073-b657-b19192178f4c
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.98it/s]
[2025-04-18 10:52:55,897]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'bd775cab-88f6-4034-bc4d-46af7d686812:text_encoder_2' (T5EncoderModel) onto mps device in 14.69s. Total model size: 9083.39MB, VRAM: 9083.39MB (100.0%)
[2025-04-18 10:52:55,997]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'bd775cab-88f6-4034-bc4d-46af7d686812:tokenizer_2' (T5Tokenizer) onto mps device in 0.00s. Total model size: 0.03MB, VRAM: 0.00MB (0.0%)
[2025-04-18 10:52:59,151]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '706b4bbb-35c6-4eaf-bb53-a46942dfcc76:text_encoder' (CLIPTextModel) onto mps device in 0.11s. Total model size: 469.44MB, VRAM: 469.44MB (100.0%)
[2025-04-18 10:52:59,229]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '706b4bbb-35c6-4eaf-bb53-a46942dfcc76:tokenizer' (CLIPTokenizer) onto mps device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/loaders.py:15: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:209.)
torch_tensor = torch.from_numpy(tensor.data)
[2025-04-18 10:53:15,313]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'b25b076a-3489-4ffa-b8f3-f2a667f9beb8:transformer' (Flux) onto mps device in 15.55s. Total model size: 12119.51MB, VRAM: 12119.51MB (100.0%)
0%| | 0/25 [00:00<?, ?it/s]
[2025-04-18 10:53:15,385]::[InvokeAI]::ERROR --> Error while invoking session b7166adc-671c-4073-b657-b19192178f4c, invocation 29c85863-56cd-4fbc-9cc0-0d7c8cb97656 (flux_denoise): Tensor for argument weight is on cpu but expected on mps
[2025-04-18 10:53:15,385]::[InvokeAI]::ERROR --> Traceback (most recent call last):
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
output = invocation.invoke_internal(context=context, services=self._services)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/baseinvocation.py", line 212, in invoke_internal
output = self.invoke(context)
^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 155, in invoke
latents = self._run_diffusion(context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 379, in _run_diffusion
x = denoise(
^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/denoise.py", line 75, in denoise
pred = model(
^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/model.py", line 110, in forward
img = self.img_in(img)
^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py", line 84, in forward
return super().forward(input)
^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 161, in __torch_dispatch__
return GGML_TENSOR_OP_TABLE[func](func, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 22, in dequantize_and_run
return func(*dequantized_args, **dequantized_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/_ops.py", line 723, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tensor for argument weight is on cpu but expected on mps
[2025-04-18 10:53:15,391]::[InvokeAI]::INFO --> Graph stats: b7166adc-671c-4073-b657-b19192178f4c
Node Calls Seconds VRAM Used
flux_model_loader 1 0.006s 0.000G
flux_text_encoder 1 18.537s 0.000G
collect 1 0.000s 0.000G
flux_denoise 1 16.035s 0.000G
TOTAL GRAPH EXECUTION TIME: 34.579s
TOTAL GRAPH WALL TIME: 34.581s
RAM used by InvokeAI process: 0.85G (+0.240G)
RAM used to load models: 21.16G
RAM cache statistics:
Model cache hits: 5
Model cache misses: 5
Models cached: 1
Models cleared from cache: 4
Cache high water mark: 11.84/0.00G |
Here's my yaml file incase its due to memory settings
|
I think it's the When Deleting the setting from invoke.yaml makes it run again. Keeping a CPU copy is not really a thing on Unified memory devices, I switched it off because CogView4 was using loads of swap space when I knew that if you don't keeps any extra models, and unload the text encoders after use it just about runs without swap in my 24Gb iMac. |
I've confirmed that issue is keep_ram_copy_of_weights: false and that when this value is use the GGUF state dict is not moved to the mps device, I can't fin any code that would load it there. With a full fat model the state dict is moved by the If I wrap that in debug code if self._cpu_state_dict is not None:
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._cpu_state_dict.items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True)
debug_key = next(iter(self._model.state_dict()))
print(f"{__name__}: {self._model.state_dict()[debug_key]}")
self._model.to(self._compute_device)
print(f"{__name__}: {self._model.state_dict()[debug_key]}") for a full fat model I get
for a GGUF model I have to change the debug print to debug_key = next(iter(self._model.state_dict()))
debug_tensor = self._model.state_dict()[debug_key]
if hasattr(debug_tensor, 'get_dequantized_tensor'):
print(f"{__name__}: {debug_tensor.get_dequantized_tensor()}") and I get both before and after on the cpu device.
I'm not sure how this isn't broken on CUDA, I can't see why it would work there without the default device being set to a cuda device. If I remove if self._cpu_state_dict is not None:
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._cpu_state_dict.items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True)
self._model.to(self._compute_device)
# force a copy of the state dict to the compute device
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._model.state_dict().items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True) Hopefully there are better ways to do that is it needs up necessary, as it seems to be a bunch of extra copying of data, perhaps an override of |
Okay i think I've got to the bottom of it. Adding an override for @overload
def to(self, *args, **kwargs) -> torch.Tensor: ...
def to(self, *args, **kwargs):
self.quantized_data = self.quantized_data.to(*args, **kwargs)
return self Note that the PyTorch docs say that |
@Vargol jeez! Nice detective work. Would you mind PRing the fix? |
Is there an existing issue for this problem?
Operating system
macOS
GPU vendor
Apple Silicon (MPS)
GPU model
M3
GPU VRAM
24
Version number
5.10.0
Browser
Safari 18.3.1
Python dependencies
{
"version": "5.10.0",
"dependencies": {
"accelerate" : "1.6.0" ,
"compel" : "2.0.2" ,
"cuda" : null ,
"diffusers" : "0.33.0" ,
"numpy" : "1.26.4" ,
"opencv" : "4.9.0.80",
"onnx" : "1.16.1" ,
"pillow" : "11.2.1" ,
"python" : "3.11.10" ,
"torch" : "2.6.0" ,
"torchvision" : "0.21.0" ,
"transformers": "4.51.3" ,
"xformers" : null
},
"config": {
"schema_version": "4.0.2",
"legacy_models_yaml_path": null,
"host": "127.0.0.1",
"port": 9090,
"allow_origins": [],
"allow_credentials": true,
"allow_methods": [""],
"allow_headers": [""],
"ssl_certfile": null,
"ssl_keyfile": null,
"log_tokenization": false,
"patchmatch": true,
"models_dir": "models",
"convert_cache_dir": "models/.convert_cache",
"download_cache_dir": "models/.download_cache",
"legacy_conf_dir": "configs",
"db_dir": "databases",
"outputs_dir": "/Users/davidburnett/invokeai/outputs",
"custom_nodes_dir": "nodes",
"style_presets_dir": "style_presets",
"workflow_thumbnails_dir": "workflow_thumbnails",
"log_handlers": ["console"],
"log_format": "color",
"log_level": "info",
"log_sql": false,
"log_level_network": "warning",
"use_memory_db": false,
"dev_reload": false,
"profile_graphs": false,
"profile_prefix": null,
"profiles_dir": "profiles",
"max_cache_ram_gb": null,
"max_cache_vram_gb": null,
"log_memory_usage": false,
"device_working_mem_gb": 3,
"enable_partial_loading": false,
"keep_ram_copy_of_weights": false,
"ram": null,
"vram": null,
"lazy_offload": true,
"pytorch_cuda_alloc_conf": null,
"device": "mps",
"precision": "bfloat16",
"sequential_guidance": false,
"attention_type": "torch-sdp",
"attention_slice_size": 1,
"force_tiled_decode": false,
"pil_compress_level": 1,
"max_queue_size": 10000,
"clear_queue_on_startup": false,
"allow_nodes": null,
"deny_nodes": null,
"node_cache_size": 512,
"hashing_algorithm": "blake3_single",
"remote_api_tokens": null,
"scan_models_on_startup": false
},
"set_config_fields": [
"precision" , "outputs_dir" , "keep_ram_copy_of_weights", "attention_type" ,
"attention_slice_size" , "legacy_models_yaml_path" , "device"
]
}
What happened
Running a simple Linear UI Flux render using a GGUF based model now fails with
I've try multiple GGUF based models and they'v all failed with the same error, an OG non-quantised Flux model works fine.
The Full backtrace is...
I've created a debug version of
dequantize_and_run
which suggest the weight and bias are on the CPU device after dequantize.and are on the CPU before quantisation too
What you expected to happen
I expected the GGUF models to work and produce an image
How to reproduce the problem
Attempt to generate an image using a GGUF quantised model, even a simple Linear UI render with no control models or LoRA's.
Additional context
No response
Discord username
Vargol
The text was updated successfully, but these errors were encountered: