convert : experimental support for `--mmproj` flag #13023

ngxson · 2025-04-19T13:43:54Z

This PR brings first support for getting mmproj file directly from convert_hf_to_gguf.py script.

It is enabled via an additional flag --mmproj. Once specified, the multimodal projector will be converted to a dedicated GGUF file, eliminate the need of having a surgery script and a dedicated mmproj conversion script.

Currently, ONLY gemma 3 is supported.

Demo:

cd gemma3-4b-it

# convert text model
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
# output file: model.gguf

# convert vision model
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 --mmproj .
# output file: mmproj-model.gguf

Design decisions:

While we can output both mmproj + text models (2 files) at the same time, I feel like it's more intuitive to only output one at a time (so one run == one output file)
A dummy MODEL_ARCH.CLIP_VISION is added to allow using the tensor_mapping.py infrastructure. This may look quite ugly, but currently it's the easiest way
GGUFWriter for mmproj is wrapped inside a new MultimodalModel class. This reduce the complexity of the main Model object, but I could be wrong. To be discussed
In the future, when mmproj and text model can be merged into single gguf (proposal: llama : second attempt to refactor vision API #11292), we can remove or deprecate this flag

ngxson · 2025-04-19T13:46:05Z

convert_hf_to_gguf.py

+    # for vision encoders
+    mmproj: bool
+    ignore_vision: bool = False # subclasses may overwrite this
+    mtmd_model: MultimodalModel | None = None


@compilade currently, the GGUFWriter for mmproj file is wrapped inside MultimodalModel. The consequence is that MultimodalModel is now an attribute of Model

Another way is to male MultimodalModel inherits Model, but this seems a bit complicated to think about. Not sure which way you prefer?

currently, the GGUFWriter for mmproj file is wrapped inside MultimodalModel.

This means that when self.mmproj is true, then self.gguf_writer is unused (but still created (!)), and another GGUFWriter is created somewhere in self.mtmd_model. It works because the output files are no longer opened/created as soon as the GGUFWriter is instantiated since #7827. (but there's still some unnecessary metadata keys set and ignored)

There's probably some way to simplify this.

What seems to be needed (eventually, to make this cleaner) is some more general abstraction to convert submodels (unless I'm misunderstanding the problem).

A submodel is part of a model, and a model is one or more submodels. Not quite sure how that should interact with model architectures, though. Each submodel could have its own architecture and tensor mappings, but I don't know what the main model architecture would be (the first submodel? a meta-model? or maybe there doesn't need to be a main one).

Since model loading doesn't quite support sub-models yet (we'll need to figure out namespaces or other ideas from #13028), only one submodel can be exported at a time, but at least conceptually it might be simpler to adapt such an abstraction to actually include multiple submodels in a single GGUF file once we've figured that out.

Another way is to make MultimodalModel inherit Model, but this seems a bit complicated to think about. Not sure which way you prefer?

I think I prefer the way you currently did it for now, because you're right that Model does a lot, and refactoring multimodal support will be simpler by duplicating some parts of Model in a smaller class like with MultimodalModel until we figure out something cleaner.

(I also don't know how MultimodalModel could cleanly subclass Model in this case)

A submodel is part of a model, and a model is one or more submodels. Not quite sure how that should interact with model architectures, though. Each submodel could have its own architecture and tensor mappings, but I don't know what the main model architecture would be (the first submodel? a meta-model? or maybe there doesn't need to be a main one).

That's some interesting questions. What I'm thinking is:

A model is currently equal to one HF repo. So for example with multimodal, one model contains both text model's tensors and vision/audio/etc model's tensors

A submodel is equal to a llama_model or clip_model, which only loads some tensors that it needs

For submodel compatible with libllama, it is distinguished by model arch, so currently each submodel has one submodel.arch. But this can be tricky in the case of models for clip.cpp which does not care about arch (the equivalent is the notion of "projector type")

(I also don't know how MultimodalModel could cleanly subclass Model in this case)

So from my POV above, what I'm thinking is that a submodel is just a Model with a custom list of tensors and metadata

One idea could be:

Having a generic Model that provides some basic functions like reading safetensors, GGUFWriter, etc

Moving LLM-specific logic (like vocab) into a base class TextModel that inherits Model

Finally, MultimodalModel inherits Model

Since model loading doesn't quite support sub-models yet (we'll need to figure out namespaces or other ideas from #13028)

Please note that, the currently and the mentioned issue is not quite related atm. The main problem is that mmproj is currently not supported by libllama, so it's currently not possible to bundle mmproj + LLM.

My currently PR is made mostly for 2 purposes:

To provide a more intuitive way to get mmproj file, since the current way requires firstly doing a surgery then convert 2 models

To unify all conversion scripts under examples/llava since some of them are very hacky and better to just abandon them

Having a generic Model that provides some basic functions like reading safetensors, GGUFWriter, etc

Moving LLM-specific logic (like vocab) into a base class TextModel that inherits Model

Finally, MultimodalModel inherits Model

Ok so I ended up doing this and this seems to be more generic (while being less hacky at the same time), please have a look on this commit: ddd7920

The main idea is to have VisionModel and TextModel both inherits Model super class, and existing text models inherit TextModel (hence why you see many LOC changed in the commit, but most of them are just changing Model --> TextModel)

Btw, it would be nice if we can finalize this during the week, so I can go ahead and add SmolVLM support. The clip.cpp implementation should be very straight-forward, the only thing blocking me is that the current mmproj conversion script is a nightmare to work with 😂 So would be nice if we can finally use convert_hf_to_gguf to get the mmproj

The main idea is to have VisionModel and TextModel both inherit Model super class

Right, this does feel much better, especially with how quant types are overridden in the intended way.

LoraModel will likely need adaptation, though. Not sure if it should be based on TextModel or Model still. (Does it make sense to have LoRA adapters of mmproj?)

(Does it make sense to have LoRA adapters of mmproj?)

I haven't seen anyone doing this, so I guess it doesn't make sense practically. In most (if not all) cases, people interested in doing LoRA for text model because it's easier to prepare the dataset.

And since LoraModel using Model.from_model_architecture which returns the TextModel subclass by default, I think it will continue to work as-is. Can you think of any cases which need to be adapted?

ngxson · 2025-04-19T13:47:47Z

convert_hf_to_gguf.py

+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.IMAGE_SIZE,           self.find_hparam(["image_size"]))
+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.PATCH_SIZE,           self.find_hparam(["patch_size"]))
+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.EMBEDDING_LENGTH,     self.find_hparam(["hidden_size"]))
+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.FEED_FORWARD_LENGTH,  self.find_hparam(["intermediate_size"]))
+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.BLOCK_COUNT,          self.find_hparam(["num_hidden_layers"]))
+        self.gguf_writer.add_uint32(gguf.Keys.ClipVision.Attention.HEAD_COUNT, self.find_hparam(["num_attention_heads"]))


Note that I didn't add gguf_writer.add_* wrapper functions because I don't yet have the full list of keys. This can be done in a follow-up PR

Dampfinchen · 2025-04-19T15:05:46Z

"In the future, when mmproj and text model can be merged into single gguf (proposal: #11292), we can remove or deprecate this flag"

This is a good idea in theory, however from an enduser perspective, some might not use vision for multimodal models like Gemma 3 and for those, always loading the mmproj would slow down their experience by a lot as the vision model takes up quite a few of GPU layers.

When integrating the vision model into the model GGUF file, I think it would make sense to give the users the option to load the included mmproj in the GGUF with a command like "--load-mmproj" while the default shall not load it. That way users can continue to decide to prioritize speed or vision capabilities, depending on their needs and hardware configuration.

Just a small suggestion. Thank you for your great work!

ngxson · 2025-04-19T17:11:34Z

When integrating the vision model into the model GGUF file, I think it would make sense to give the users the option to load the included mmproj in the GGUF with a command like "--load-mmproj" while the default shall not load it.

Once GGUF can contain more than 1 model, it will be simple to add ability to load part of it (and even if I don't add this feature, someone will - So I think we don't need to be too worry about that)

The more complicated thing is how to get there. New models started to have not just vision but also audio input, then some may output more than just text. At some points we will have to specify bunch of things just to run a single model. For ex. in theory, phi-4-mm alone requires 5 GGUFs to run both audio+vision, so having ability to export single GGUF is still much better than nothing.

ngxson · 2025-04-20T09:49:23Z

convert_hf_to_gguf.py

+    TEXT = 1
+    VISION = 2
+
+
 AnyModel = TypeVar("AnyModel", bound="type[Model]")


 class Model:


On second thought, it is better to name this base class ModelBase? (Or maybe something more function-oriented, like ModelWriter or ModelConverter)

This is to prevent other opening PRs from accidentally keeping their models inherit Model upon merging to master

Yes, renaming Model is a good idea since it did change role. ModelBase seems good.

That will also trigger type errors in convert_lora_to_gguf.py though.

I updated both convert_lora and convert_hf scripts in 93b5f71

compilade · 2025-04-20T17:46:07Z

convert_hf_to_gguf.py

    undo_permute = False
+    ignore_vision = True


This isn't used anywhere

Suggested change

ignore_vision = True

Unrelated to the conversation, but Llama 4 vision support also seem to be a low-hanging fruit. They no longer use cross-attn like in llama 3, here it's just simple embeddings passed from encoder to decoder, so also would be a nice thing to try out.

(Noting here so I remember)

convert_hf_to_gguf.py

compilade · 2025-04-20T17:48:15Z

convert_hf_to_gguf.py

 class Mistral3Model(LlamaModel):
    model_arch = gguf.MODEL_ARCH.LLAMA

    # we need to merge the text_config into the root level of hparams
    def __init__(self, *args, **kwargs):
-        hparams = kwargs["hparams"] if "hparams" in kwargs else Model.load_hparams(args[0])
+        hparams = kwargs["hparams"] if "hparams" in kwargs else ModelBase.load_hparams(args[0])
        if "text_config" in hparams:
            hparams = {**hparams, **hparams["text_config"]}
            kwargs["hparams"] = hparams


This might not be necessary anymore since ModelBase.load_hparams handles text_config since d5e03e6

Thanks for noticing, removed in e37dec6

I have never looked into Pixtral / Mistral "vision" series until now. They use a quite clever approach, add a special text token in-between image embeddings to preserve the spatial info. Seems like a low-hanging fruit to try

Co-authored-by: compilade <git@compilade.net>

convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * fix Mistral3Model * fix typo Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>

convert : experimental support for --mmproj flag

d5e03e6

ngxson requested a review from compilade April 19, 2025 13:43

github-actions bot added examples python python script changes labels Apr 19, 2025

ngxson commented Apr 19, 2025

View reviewed changes

fix bad ctrl+f replace

d59a7bb

ngxson commented Apr 19, 2025

View reviewed changes

fix style

55651ad

split into subclasses TextModel and VisionModel

ddd7920

ngxson commented Apr 20, 2025

View reviewed changes

ngxson added 3 commits April 20, 2025 19:02

rename Mode --> ModelBase

93b5f71

small fix

43c2e75

correct CLIP_VISION arch name (because existing GGUF already use it)

ad186f4

compilade reviewed Apr 20, 2025

View reviewed changes

ngxson and others added 2 commits April 20, 2025 20:46

Apply suggestions from code review

0822e15

Co-authored-by: compilade <git@compilade.net>

fix Mistral3Model

e37dec6

compilade approved these changes Apr 20, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

fix typo

1f71edb

Co-authored-by: compilade <git@compilade.net>

ngxson merged commit 2016f07 into ggml-org:master Apr 20, 2025
51 checks passed

ngxson mentioned this pull request Apr 23, 2025

server: Bring back multimodal support #8010

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : experimental support for `--mmproj` flag #13023

convert : experimental support for `--mmproj` flag #13023

ngxson commented Apr 19, 2025

ngxson Apr 19, 2025

compilade Apr 20, 2025

ngxson Apr 20, 2025

ngxson Apr 20, 2025

compilade Apr 20, 2025

ngxson Apr 20, 2025

ngxson Apr 19, 2025 •

edited

Loading

Dampfinchen commented Apr 19, 2025

ngxson commented Apr 19, 2025

ngxson Apr 20, 2025 •

edited

Loading

compilade Apr 20, 2025

ngxson Apr 20, 2025

compilade Apr 20, 2025

ngxson Apr 20, 2025 •

edited

Loading

compilade Apr 20, 2025

ngxson Apr 20, 2025

ngxson Apr 20, 2025 •

edited

Loading

convert : experimental support for --mmproj flag #13023

convert : experimental support for --mmproj flag #13023

Conversation

ngxson commented Apr 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

Dampfinchen commented Apr 19, 2025

ngxson commented Apr 19, 2025

ngxson Apr 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Apr 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Apr 20, 2025 • edited Loading

Choose a reason for hiding this comment

convert : experimental support for `--mmproj` flag #13023

convert : experimental support for `--mmproj` flag #13023

ngxson Apr 19, 2025 •

edited

Loading

ngxson Apr 20, 2025 •

edited

Loading

ngxson Apr 20, 2025 •

edited

Loading

ngxson Apr 20, 2025 •

edited

Loading