[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939

Vargol · 2025-04-18T11:37:29Z

Is there an existing issue for this problem?

I have searched the existing issues

Operating system

macOS

GPU vendor

Apple Silicon (MPS)

GPU model

M3

GPU VRAM

24

Version number

5.10.0

Browser

Safari 18.3.1

Python dependencies

{
"version": "5.10.0",
"dependencies": {
"accelerate" : "1.6.0" ,
"compel" : "2.0.2" ,
"cuda" : null ,
"diffusers" : "0.33.0" ,
"numpy" : "1.26.4" ,
"opencv" : "4.9.0.80",
"onnx" : "1.16.1" ,
"pillow" : "11.2.1" ,
"python" : "3.11.10" ,
"torch" : "2.6.0" ,
"torchvision" : "0.21.0" ,
"transformers": "4.51.3" ,
"xformers" : null
},
"config": {
"schema_version": "4.0.2",
"legacy_models_yaml_path": null,
"host": "127.0.0.1",
"port": 9090,
"allow_origins": [],
"allow_credentials": true,
"allow_methods": [""],
"allow_headers": [""],
"ssl_certfile": null,
"ssl_keyfile": null,
"log_tokenization": false,
"patchmatch": true,
"models_dir": "models",
"convert_cache_dir": "models/.convert_cache",
"download_cache_dir": "models/.download_cache",
"legacy_conf_dir": "configs",
"db_dir": "databases",
"outputs_dir": "/Users/davidburnett/invokeai/outputs",
"custom_nodes_dir": "nodes",
"style_presets_dir": "style_presets",
"workflow_thumbnails_dir": "workflow_thumbnails",
"log_handlers": ["console"],
"log_format": "color",
"log_level": "info",
"log_sql": false,
"log_level_network": "warning",
"use_memory_db": false,
"dev_reload": false,
"profile_graphs": false,
"profile_prefix": null,
"profiles_dir": "profiles",
"max_cache_ram_gb": null,
"max_cache_vram_gb": null,
"log_memory_usage": false,
"device_working_mem_gb": 3,
"enable_partial_loading": false,
"keep_ram_copy_of_weights": false,
"ram": null,
"vram": null,
"lazy_offload": true,
"pytorch_cuda_alloc_conf": null,
"device": "mps",
"precision": "bfloat16",
"sequential_guidance": false,
"attention_type": "torch-sdp",
"attention_slice_size": 1,
"force_tiled_decode": false,
"pil_compress_level": 1,
"max_queue_size": 10000,
"clear_queue_on_startup": false,
"allow_nodes": null,
"deny_nodes": null,
"node_cache_size": 512,
"hashing_algorithm": "blake3_single",
"remote_api_tokens": null,
"scan_models_on_startup": false
},
"set_config_fields": [
"precision" , "outputs_dir" , "keep_ram_copy_of_weights", "attention_type" ,
"attention_slice_size" , "legacy_models_yaml_path" , "device"
]
}

What happened

Running a simple Linear UI Flux render using a GGUF based model now fails with

  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tensor for argument weight is on cpu but expected on mps

I've try multiple GGUF based models and they'v all failed with the same error, an OG non-quantised Flux model works fine.

The Full backtrace is...

[2025-04-18 11:56:59,089]::[InvokeAI]::ERROR --> Error while invoking session ead91b2d-b83d-4fef-a13d-bf9bf9923340, invocation 4c89d2a6-b5e1-466d-95bc-cf2c3aa14333 (flux_denoise): Tensor for argument weight is on cpu but expected on mps
[2025-04-18 11:56:59,089]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/baseinvocation.py", line 212, in invoke_internal
    output = self.invoke(context)
             ^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 155, in invoke
    latents = self._run_diffusion(context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 379, in _run_diffusion
    x = denoise(
        ^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/denoise.py", line 75, in denoise
    pred = model(
           ^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/model.py", line 110, in forward
    img = self.img_in(img)
          ^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py", line 84, in forward
    return super().forward(input)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 187, in __torch_dispatch__
    return GGML_TENSOR_OP_TABLE[func](func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 37, in dequantize_and_run_debug
    return func(*dequantized_args, **dequantized_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tensor for argument weight is on cpu but expected on mps

I've created a debug version of dequantize_and_run which suggest the weight and bias are on the CPU device after dequantize.

tensor([[[-0.0454,  1.5703,  0.4180,  ..., -0.7031,  0.2451,  2.5156],
         [ 0.5703,  0.5234,  0.4609,  ..., -0.7969,  0.1670, -1.1016],
         [ 0.4414, -0.2070, -0.1963,  ..., -0.6367, -2.0938, -0.9922],
         ...,
         [-0.2500,  0.3066,  0.0148,  ..., -0.1113,  0.7812, -0.3320],
         [ 1.6719,  1.1016,  0.0967,  ...,  1.0781,  0.2119, -0.0154],
         [-0.3008, -0.4980,  0.7500,  ...,  0.2148, -0.4492, -0.9922]]],
       device='mps:0', dtype=torch.bfloat16)
---------------------------------------
tensor([[-0.0280,  0.0266, -0.0262,  ...,  0.0250, -0.0146, -0.0339],
        [-0.0029, -0.0022, -0.0571,  ..., -0.0233,  0.0320,  0.0762],
        [-0.0317, -0.0228,  0.0294,  ...,  0.0176, -0.0413,  0.0415],
        ...,
        [ 0.0291, -0.0141, -0.0147,  ..., -0.0237,  0.0273,  0.0167],
        [-0.0153,  0.0361,  0.0374,  ...,  0.0039, -0.0464,  0.0461],
        [-0.0737,  0.1211, -0.1138,  ...,  0.0767, -0.0947, -0.0762]],
       dtype=torch.bfloat16)
---------------------------------------
tensor([ 0.0081,  0.0062,  0.0003,  ..., -0.0205,  0.0298, -0.0289],
       dtype=torch.bfloat16)
---------------------------------------

and are on the CPU before quantisation too

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
tensor([[[-0.0454,  1.5703,  0.4180,  ..., -0.7031,  0.2451,  2.5156],
         [ 0.5703,  0.5234,  0.4609,  ..., -0.7969,  0.1670, -1.1016],
         [ 0.4414, -0.2070, -0.1963,  ..., -0.6367, -2.0938, -0.9922],
         ...,
         [-0.2500,  0.3066,  0.0148,  ..., -0.1113,  0.7812, -0.3320],
         [ 1.6719,  1.1016,  0.0967,  ...,  1.0781,  0.2119, -0.0154],
         [-0.3008, -0.4980,  0.7500,  ...,  0.2148, -0.4492, -0.9922]]],
       device='mps:0', dtype=torch.bfloat16)
---------------------------------------
GGMLTensor(type=F32, dequantized_shape=(torch.Size([3072, 64]))
tensor([[-0.0280,  0.0266, -0.0262,  ...,  0.0250, -0.0146, -0.0339],
        [-0.0029, -0.0022, -0.0571,  ..., -0.0233,  0.0320,  0.0762],
        [-0.0317, -0.0228,  0.0294,  ...,  0.0176, -0.0413,  0.0415],
        ...,
        [ 0.0291, -0.0141, -0.0147,  ..., -0.0237,  0.0273,  0.0167],
        [-0.0153,  0.0361,  0.0374,  ...,  0.0039, -0.0464,  0.0461],
        [-0.0737,  0.1211, -0.1138,  ...,  0.0767, -0.0947, -0.0762]])
---------------------------------------
GGMLTensor(type=F32, dequantized_shape=(torch.Size([3072]))
tensor([ 0.0081,  0.0062,  0.0003,  ..., -0.0205,  0.0298, -0.0289])
---------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

What you expected to happen

I expected the GGUF models to work and produce an image

How to reproduce the problem

Attempt to generate an image using a GGUF quantised model, even a simple Linear UI render with no control models or LoRA's.

Additional context

No response

Discord username

Vargol

The text was updated successfully, but these errors were encountered:

Vargol · 2025-04-18T12:27:34Z

I mean to post this, its the full log from startup to failure.

[2025-04-18 10:50:17,475]::[InvokeAI]::INFO --> Using torch device: MPS
objc[2186]: Class CaptureDelegate is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e66b8) and /opt/local/lib/opencv4/libopencv_videoio.4.9.0.dylib (0x16b7d07d0). One of the two will be used. Which one is undefined.
objc[2186]: Class CVWindow is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6708) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8a78). One of the two will be used. Which one is undefined.
objc[2186]: Class CVView is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6730) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8aa0). One of the two will be used. Which one is undefined.
objc[2186]: Class CVSlider is implemented in both /Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/cv2/cv2.abi3.so (0x1633e6758) and /opt/local/lib/opencv4/libopencv_highgui.4.9.0.dylib (0x1646b8ac8). One of the two will be used. Which one is undefined.
[2025-04-18 10:50:22,155]::[InvokeAI]::INFO --> Patchmatch initialized
[2025-04-18 10:50:22,746]::[InvokeAI]::INFO --> Loading node pack StableCascade
[2025-04-18 10:50:22,755]::[InvokeAI]::INFO --> Loading node pack CosXLDenoiseLantents
[2025-04-18 10:50:22,756]::[InvokeAI]::ERROR --> Failed to load node pack CosXLDenoiseLantents (may have partially loaded):
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/load_custom_nodes.py", line 69, in load_custom_nodes
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/davidburnett/invokeai/nodes/CosXLDenoiseLantents/__init__.py", line 1, in <module>
    from .cosXLDenoiseLantents import CosXLDenoiseLatentsInvocation
  File "/Users/davidburnett/invokeai/nodes/CosXLDenoiseLantents/cosXLDenoiseLantents.py", line 5, in <module>
    from invokeai.app.invocations.latent import DenoiseLatentsInvocation
ModuleNotFoundError: No module named 'invokeai.app.invocations.latent'

[2025-04-18 10:50:22,756]::[InvokeAI]::INFO --> Loading node pack StylePrompts
[2025-04-18 10:50:22,758]::[InvokeAI]::INFO --> Loaded 2 node packs from /Users/davidburnett/invokeai/nodes: StableCascade, StylePrompts
[2025-04-18 10:50:22,794]::[InvokeAI]::INFO --> InvokeAI version 5.10.0
[2025-04-18 10:50:22,794]::[InvokeAI]::INFO --> Root directory = /Users/davidburnett/invokeai
[2025-04-18 10:50:22,795]::[InvokeAI]::INFO --> Initializing database at /Users/davidburnett/invokeai/databases/invokeai.db
[2025-04-18 10:50:22,806]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 10240.00 MB. Heuristics applied: [1].
[2025-04-18 10:50:22,840]::[ModelInstallService]::WARNING --> Missing model file: terminus-xl-velocity-v2 at sdxl/main/terminus-xl-velocity-v2
[2025-04-18 10:50:22,841]::[InvokeAI]::INFO --> Pruned 5 finished queue items
[2025-04-18 10:50:22,903]::[InvokeAI]::INFO --> Invoke running on http://127.0.0.1:9090 (Press CTRL+C to quit)
[2025-04-18 10:52:40,796]::[InvokeAI]::INFO --> Executing queue item 5816, session b7166adc-671c-4073-b657-b19192178f4c
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.98it/s]
[2025-04-18 10:52:55,897]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'bd775cab-88f6-4034-bc4d-46af7d686812:text_encoder_2' (T5EncoderModel) onto mps device in 14.69s. Total model size: 9083.39MB, VRAM: 9083.39MB (100.0%)
[2025-04-18 10:52:55,997]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'bd775cab-88f6-4034-bc4d-46af7d686812:tokenizer_2' (T5Tokenizer) onto mps device in 0.00s. Total model size: 0.03MB, VRAM: 0.00MB (0.0%)
[2025-04-18 10:52:59,151]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '706b4bbb-35c6-4eaf-bb53-a46942dfcc76:text_encoder' (CLIPTextModel) onto mps device in 0.11s. Total model size: 469.44MB, VRAM: 469.44MB (100.0%)
[2025-04-18 10:52:59,229]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '706b4bbb-35c6-4eaf-bb53-a46942dfcc76:tokenizer' (CLIPTokenizer) onto mps device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/loaders.py:15: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:209.)
  torch_tensor = torch.from_numpy(tensor.data)
[2025-04-18 10:53:15,313]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'b25b076a-3489-4ffa-b8f3-f2a667f9beb8:transformer' (Flux) onto mps device in 15.55s. Total model size: 12119.51MB, VRAM: 12119.51MB (100.0%)
  0%|                                                                                            | 0/25 [00:00<?, ?it/s]
[2025-04-18 10:53:15,385]::[InvokeAI]::ERROR --> Error while invoking session b7166adc-671c-4073-b657-b19192178f4c, invocation 29c85863-56cd-4fbc-9cc0-0d7c8cb97656 (flux_denoise): Tensor for argument weight is on cpu but expected on mps
[2025-04-18 10:53:15,385]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/baseinvocation.py", line 212, in invoke_internal
    output = self.invoke(context)
             ^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 155, in invoke
    latents = self._run_diffusion(context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/app/invocations/flux_denoise.py", line 379, in _run_diffusion
    x = denoise(
        ^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/denoise.py", line 75, in denoise
    pred = model(
           ^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/flux/model.py", line 110, in forward
    img = self.img_in(img)
          ^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py", line 84, in forward
    return super().forward(input)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 161, in __torch_dispatch__
    return GGML_TENSOR_OP_TABLE[func](func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/invokeai/backend/quantization/gguf/ggml_tensor.py", line 22, in dequantize_and_run
    return func(*dequantized_args, **dequantized_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.11/site-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tensor for argument weight is on cpu but expected on mps

[2025-04-18 10:53:15,391]::[InvokeAI]::INFO --> Graph stats: b7166adc-671c-4073-b657-b19192178f4c
                          Node   Calls   Seconds  VRAM Used
             flux_model_loader       1    0.006s     0.000G
             flux_text_encoder       1   18.537s     0.000G
                       collect       1    0.000s     0.000G
                  flux_denoise       1   16.035s     0.000G
TOTAL GRAPH EXECUTION TIME:  34.579s
TOTAL GRAPH WALL TIME:  34.581s
RAM used by InvokeAI process: 0.85G (+0.240G)
RAM used to load models: 21.16G
RAM cache statistics:
   Model cache hits: 5
   Model cache misses: 5
   Models cached: 1
   Models cleared from cache: 4
   Cache high water mark: 11.84/0.00G

psychedelicious · 2025-04-18T14:00:05Z

I cannot reproduce this on my M1 Pro using these models (same "main" model):

Generation is successful.

Here is my Settings (gear icon @ bottom left) -> About:

{
    "version": "5.10.0", 
    "dependencies": {
        "accelerate"  : "1.6.0"   , 
        "compel"      : "2.0.2"   , 
        "cuda"        : null      , 
        "diffusers"   : "0.33.0"  , 
        "numpy"       : "1.26.4"  , 
        "opencv"      : "4.9.0.80", 
        "onnx"        : "1.16.1"  , 
        "pillow"      : "11.2.1"  , 
        "python"      : "3.12.9"  , 
        "torch"       : "2.6.0"   , 
        "torchvision" : "0.21.0"  , 
        "transformers": "4.51.3"  , 
        "xformers"    : null        
    }, 
    "config": {
        "schema_version": "4.0.2", 
        "legacy_models_yaml_path": null, 
        "host": "127.0.0.1", 
        "port": 9090, 
        "allow_origins": [], 
        "allow_credentials": true, 
        "allow_methods": ["*"], 
        "allow_headers": ["*"], 
        "ssl_certfile": null, 
        "ssl_keyfile": null, 
        "log_tokenization": false, 
        "patchmatch": true, 
        "models_dir": "models", 
        "convert_cache_dir": "models/.convert_cache", 
        "download_cache_dir": "models/.download_cache", 
        "legacy_conf_dir": "configs", 
        "db_dir": "databases", 
        "outputs_dir": "outputs", 
        "custom_nodes_dir": "nodes", 
        "style_presets_dir": "style_presets", 
        "workflow_thumbnails_dir": "workflow_thumbnails", 
        "log_handlers": ["console"], 
        "log_format": "color", 
        "log_level": "info", 
        "log_sql": false, 
        "log_level_network": "warning", 
        "use_memory_db": false, 
        "dev_reload": false, 
        "profile_graphs": false, 
        "profile_prefix": null, 
        "profiles_dir": "profiles", 
        "max_cache_ram_gb": null, 
        "max_cache_vram_gb": null, 
        "log_memory_usage": false, 
        "device_working_mem_gb": 3, 
        "enable_partial_loading": false, 
        "keep_ram_copy_of_weights": true, 
        "ram": null, 
        "vram": null, 
        "lazy_offload": true, 
        "pytorch_cuda_alloc_conf": null, 
        "device": "auto", 
        "precision": "float16", 
        "sequential_guidance": false, 
        "attention_type": "auto", 
        "attention_slice_size": "auto", 
        "force_tiled_decode": false, 
        "pil_compress_level": 1, 
        "max_queue_size": 10000, 
        "clear_queue_on_startup": false, 
        "allow_nodes": null, 
        "deny_nodes": null, 
        "node_cache_size": 512, 
        "hashing_algorithm": "blake3_single", 
        "remote_api_tokens": null, 
        "scan_models_on_startup": false
    }, 
    "set_config_fields": ["precision", "legacy_models_yaml_path"]
}

On python 3.12.9.

Vargol · 2025-04-18T16:00:49Z

Here's my yaml file incase its due to memory settings

# Internal metadata - do not edit:
schema_version: 4.0.2

# Put user settings here - see https://invoke-ai.github.io/InvokeAI/features/CONFIGURATION/:
outputs_dir: /Users/xxxx/invokeai/outputs
# ram: 11.0
device: mps
precision: bfloat16
attention_type: torch-sdp
attention_slice_size: 1
keep_ram_copy_of_weights: false
# force_tiled_decode: true

Vargol · 2025-04-18T17:08:53Z

I think it's the keep_ram_copy_of_weights: false thats causing the issue, the state_dict is loaded onto the CPU device.
When keeping a CPU copy that gets copied to cpu_state_dict, which get moved to the mps device and copied into the state_dict when the RAM model is cached model 'moved' to VRAM.

When keep_ram_copy_of_weights: false is set it doesn't get copied to cpu_state_dict and never gets moved to the 'mps' device. The model.to function doesn't move it either.

Deleting the setting from invoke.yaml makes it run again.

Keeping a CPU copy is not really a thing on Unified memory devices, I switched it off because CogView4 was using loads of swap space when I knew that if you don't keeps any extra models, and unload the text encoders after use it just about runs without swap in my 24Gb iMac.

Vargol · 2025-04-19T11:57:05Z

I've confirmed that issue is keep_ram_copy_of_weights: false and that when this value is use the GGUF state dict is not moved to the mps device, I can't fin any code that would load it there. With a full fat model the state dict is moved by the self._model.to(self._compute_device) call in invokeai.backend.model_manager.load.model_cache.cached_model.cached_model_only_full_load.full_load_to_vram

If I wrap that in debug code

        if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)


        debug_key = next(iter(self._model.state_dict()))
        print(f"{__name__}: {self._model.state_dict()[debug_key]}")

        self._model.to(self._compute_device)

        print(f"{__name__}: {self._model.state_dict()[debug_key]}")

for a full fat model I get

invokeai.backend.model_manager.load.model_cache.cached_model.cached_model_only_full_load: tensor([[-0.0310,  0.0192, -0.0266,  ...,  0.0311, -0.0151, -0.0347],
        [-0.0016, -0.0024, -0.0593,  ..., -0.0270,  0.0374,  0.0674],
        [-0.0320, -0.0266,  0.0275,  ...,  0.0168, -0.0439,  0.0500],
        ...,
        [ 0.0339, -0.0172, -0.0156,  ..., -0.0282,  0.0259,  0.0175],
        [-0.0245,  0.0359,  0.0344,  ...,  0.0069, -0.0444,  0.0447],
        [-0.0688,  0.1187, -0.1260,  ...,  0.0928, -0.1108, -0.0776]],
       dtype=torch.bfloat16)
invokeai.backend.model_manager.load.model_cache.cached_model.cached_model_only_full_load: tensor([[-0.0310,  0.0192, -0.0266,  ...,  0.0311, -0.0151, -0.0347],
        [-0.0016, -0.0024, -0.0593,  ..., -0.0270,  0.0374,  0.0674],
        [-0.0320, -0.0266,  0.0275,  ...,  0.0168, -0.0439,  0.0500],
        ...,
        [ 0.0339, -0.0172, -0.0156,  ..., -0.0282,  0.0259,  0.0175],
        [-0.0245,  0.0359,  0.0344,  ...,  0.0069, -0.0444,  0.0447],
        [-0.0688,  0.1187, -0.1260,  ...,  0.0928, -0.1108, -0.0776]],
       device='mps:0', dtype=torch.bfloat16)

for a GGUF model I have to change the debug print to

        debug_key = next(iter(self._model.state_dict()))
        debug_tensor = self._model.state_dict()[debug_key]
        if hasattr(debug_tensor, 'get_dequantized_tensor'):
            print(f"{__name__}: {debug_tensor.get_dequantized_tensor()}")

and I get both before and after on the cpu device.

invokeai.backend.model_manager.load.model_cache.cached_model.cached_model_only_full_load: tensor([[-0.0310,  0.0192, -0.0266,  ...,  0.0311, -0.0151, -0.0347],
        [-0.0016, -0.0024, -0.0593,  ..., -0.0270,  0.0374,  0.0674],
        [-0.0320, -0.0266,  0.0275,  ...,  0.0168, -0.0439,  0.0500],
        ...,
        [ 0.0339, -0.0172, -0.0156,  ..., -0.0282,  0.0259,  0.0175],
        [-0.0245,  0.0359,  0.0344,  ...,  0.0069, -0.0444,  0.0447],
        [-0.0688,  0.1187, -0.1260,  ...,  0.0928, -0.1108, -0.0776]],
       dtype=torch.bfloat16)
invokeai.backend.model_manager.load.model_cache.cached_model.cached_model_only_full_load: tensor([[-0.0310,  0.0192, -0.0266,  ...,  0.0311, -0.0151, -0.0347],
        [-0.0016, -0.0024, -0.0593,  ..., -0.0270,  0.0374,  0.0674],
        [-0.0320, -0.0266,  0.0275,  ...,  0.0168, -0.0439,  0.0500],
        ...,
        [ 0.0339, -0.0172, -0.0156,  ..., -0.0282,  0.0259,  0.0175],
        [-0.0245,  0.0359,  0.0344,  ...,  0.0069, -0.0444,  0.0447],
        [-0.0688,  0.1187, -0.1260,  ...,  0.0928, -0.1108, -0.0776]],
       dtype=torch.bfloat16)

I'm not sure how this isn't broken on CUDA, I can't see why it would work there without the default device being set to a cuda device.

If I remove keep_ram_copy_of_weights: false or if I force a move of the state dict to the mps device
then GGUF works,

        if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)


        self._model.to(self._compute_device)

# force a copy of the state dict to the compute device
        new_state_dict: dict[str, torch.Tensor] = {}
        for k, v in self._model.state_dict().items():
            new_state_dict[k] = v.to(self._compute_device, copy=True)
        self._model.load_state_dict(new_state_dict, assign=True)

Hopefully there are better ways to do that is it needs up necessary, as it seems to be a bunch of extra copying of data, perhaps an override of to in the GGMLTensor class ?

Vargol · 2025-04-19T13:15:38Z

Thanks to Tiwaz for testing, it's broke on CUDA too, if partial loading and the keep ram copy are disabled.

Vargol · 2025-04-21T15:56:53Z

Okay i think I've got to the bottom of it.
The GGMLTensor implementation stores the quantised data in quantized_data instead of the inherited data so when the self._model.to call is made the conversion of the state dict doesn't update the quantised Tensor data to the new state.

Adding an override for Tensor.to in the GGMLTensor implementation fixes the issue

    @overload
    def to(self, *args, **kwargs) -> torch.Tensor: ...

    def to(self, *args, **kwargs):
        self.quantized_data = self.quantized_data.to(*args, **kwargs)
        return self

Note that the PyTorch docs say that Tensor.to can return a new tensor, and it should in this case but it seems torch.nn.Module.to is expecting a self modification. I originally wrote the override to return a new GGMLTensor but despite the move to 'mps' working fine when checked from the GGMLTensor.to function checking the state dictionary after the self_model.to call there was no sign of the updated GGMLTensors.

psychedelicious · 2025-04-22T02:29:54Z

@Vargol jeez! Nice detective work. Would you mind PRing the fix?

Vargol added the bug Something isn't working label Apr 18, 2025

Vargol linked a pull request Apr 22, 2025 that will close this issue

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939

[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939

Vargol commented Apr 18, 2025

Vargol commented Apr 18, 2025

psychedelicious commented Apr 18, 2025

Vargol commented Apr 18, 2025 •

edited

Loading

Vargol commented Apr 18, 2025

Vargol commented Apr 19, 2025 •

edited

Loading

Vargol commented Apr 19, 2025

Vargol commented Apr 21, 2025

psychedelicious commented Apr 22, 2025

[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939

[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps #7939

Comments

Vargol commented Apr 18, 2025

Is there an existing issue for this problem?

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

Python dependencies

What happened

What you expected to happen

How to reproduce the problem

Additional context

Discord username

Vargol commented Apr 18, 2025

psychedelicious commented Apr 18, 2025

Vargol commented Apr 18, 2025 • edited Loading

Vargol commented Apr 18, 2025

Vargol commented Apr 19, 2025 • edited Loading

Vargol commented Apr 19, 2025

Vargol commented Apr 21, 2025

psychedelicious commented Apr 22, 2025

Vargol commented Apr 18, 2025 •

edited

Loading

Vargol commented Apr 19, 2025 •

edited

Loading