-
Notifications
You must be signed in to change notification settings - Fork 11.6k
convert : experimental support for --mmproj
flag
#13023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
convert_hf_to_gguf.py
Outdated
# for vision encoders | ||
mmproj: bool | ||
ignore_vision: bool = False # subclasses may overwrite this | ||
mtmd_model: MultimodalModel | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade currently, the GGUFWriter
for mmproj file is wrapped inside MultimodalModel
. The consequence is that MultimodalModel
is now an attribute of Model
Another way is to male MultimodalModel
inherits Model
, but this seems a bit complicated to think about. Not sure which way you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, the
GGUFWriter
for mmproj file is wrapped insideMultimodalModel
.
This means that when self.mmproj
is true, then self.gguf_writer
is unused (but still created (!)), and another GGUFWriter
is created somewhere in self.mtmd_model
. It works because the output files are no longer opened/created as soon as the GGUFWriter
is instantiated since #7827. (but there's still some unnecessary metadata keys set and ignored)
There's probably some way to simplify this.
What seems to be needed (eventually, to make this cleaner) is some more general abstraction to convert submodels (unless I'm misunderstanding the problem).
A submodel is part of a model, and a model is one or more submodels. Not quite sure how that should interact with model architectures, though. Each submodel could have its own architecture and tensor mappings, but I don't know what the main model architecture would be (the first submodel? a meta-model? or maybe there doesn't need to be a main one).
Since model loading doesn't quite support sub-models yet (we'll need to figure out namespaces or other ideas from #13028), only one submodel can be exported at a time, but at least conceptually it might be simpler to adapt such an abstraction to actually include multiple submodels in a single GGUF file once we've figured that out.
Another way is to make
MultimodalModel
inheritModel
, but this seems a bit complicated to think about. Not sure which way you prefer?
I think I prefer the way you currently did it for now, because you're right that Model
does a lot, and refactoring multimodal support will be simpler by duplicating some parts of Model
in a smaller class like with MultimodalModel
until we figure out something cleaner.
(I also don't know how MultimodalModel
could cleanly subclass Model
in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A submodel is part of a model, and a model is one or more submodels. Not quite sure how that should interact with model architectures, though. Each submodel could have its own architecture and tensor mappings, but I don't know what the main model architecture would be (the first submodel? a meta-model? or maybe there doesn't need to be a main one).
That's some interesting questions. What I'm thinking is:
- A model is currently equal to one HF repo. So for example with multimodal, one model contains both text model's tensors and vision/audio/etc model's tensors
- A submodel is equal to a
llama_model
orclip_model
, which only loads some tensors that it needs - For submodel compatible with
libllama
, it is distinguished by model arch, so currently each submodel has onesubmodel.arch
. But this can be tricky in the case of models forclip.cpp
which does not care about arch (the equivalent is the notion of "projector type")
(I also don't know how
MultimodalModel
could cleanly subclassModel
in this case)
So from my POV above, what I'm thinking is that a submodel is just a Model
with a custom list of tensors and metadata
One idea could be:
- Having a generic
Model
that provides some basic functions like reading safetensors, GGUFWriter, etc - Moving LLM-specific logic (like vocab) into a base class
TextModel
that inheritsModel
- Finally,
MultimodalModel
inheritsModel
Since model loading doesn't quite support sub-models yet (we'll need to figure out namespaces or other ideas from #13028)
Please note that, the currently and the mentioned issue is not quite related atm. The main problem is that mmproj is currently not supported by libllama
, so it's currently not possible to bundle mmproj + LLM.
My currently PR is made mostly for 2 purposes:
- To provide a more intuitive way to get mmproj file, since the current way requires firstly doing a surgery then convert 2 models
- To unify all conversion scripts under
examples/llava
since some of them are very hacky and better to just abandon them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Having a generic
Model
that provides some basic functions like reading safetensors, GGUFWriter, etc- Moving LLM-specific logic (like vocab) into a base class
TextModel
that inheritsModel
- Finally,
MultimodalModel
inheritsModel
Ok so I ended up doing this and this seems to be more generic (while being less hacky at the same time), please have a look on this commit: ddd7920
The main idea is to have VisionModel
and TextModel
both inherits Model
super class, and existing text models inherit TextModel
(hence why you see many LOC changed in the commit, but most of them are just changing Model
--> TextModel
)
Btw, it would be nice if we can finalize this during the week, so I can go ahead and add SmolVLM support. The clip.cpp
implementation should be very straight-forward, the only thing blocking me is that the current mmproj
conversion script is a nightmare to work with 😂 So would be nice if we can finally use convert_hf_to_gguf
to get the mmproj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea is to have
VisionModel
andTextModel
both inheritModel
super class
Right, this does feel much better, especially with how quant types are overridden in the intended way.
LoraModel
will likely need adaptation, though. Not sure if it should be based on TextModel
or Model
still. (Does it make sense to have LoRA adapters of mmproj?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Does it make sense to have LoRA adapters of mmproj?)
I haven't seen anyone doing this, so I guess it doesn't make sense practically. In most (if not all) cases, people interested in doing LoRA for text model because it's easier to prepare the dataset.
And since LoraModel
using Model.from_model_architecture
which returns the TextModel
subclass by default, I think it will continue to work as-is. Can you think of any cases which need to be adapted?
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.IMAGE_SIZE, self.find_hparam(["image_size"])) | ||
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.PATCH_SIZE, self.find_hparam(["patch_size"])) | ||
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.EMBEDDING_LENGTH, self.find_hparam(["hidden_size"])) | ||
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.FEED_FORWARD_LENGTH, self.find_hparam(["intermediate_size"])) | ||
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.BLOCK_COUNT, self.find_hparam(["num_hidden_layers"])) | ||
self.gguf_writer.add_uint32(gguf.Keys.ClipVision.Attention.HEAD_COUNT, self.find_hparam(["num_attention_heads"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I didn't add gguf_writer.add_*
wrapper functions because I don't yet have the full list of keys. This can be done in a follow-up PR
"In the future, when mmproj and text model can be merged into single gguf (proposal: #11292), we can remove or deprecate this flag" This is a good idea in theory, however from an enduser perspective, some might not use vision for multimodal models like Gemma 3 and for those, always loading the mmproj would slow down their experience by a lot as the vision model takes up quite a few of GPU layers. When integrating the vision model into the model GGUF file, I think it would make sense to give the users the option to load the included mmproj in the GGUF with a command like "--load-mmproj" while the default shall not load it. That way users can continue to decide to prioritize speed or vision capabilities, depending on their needs and hardware configuration. Just a small suggestion. Thank you for your great work! |
Once GGUF can contain more than 1 model, it will be simple to add ability to load part of it (and even if I don't add this feature, someone will - So I think we don't need to be too worry about that) The more complicated thing is how to get there. New models started to have not just vision but also audio input, then some may output more than just text. At some points we will have to specify bunch of things just to run a single model. For ex. in theory, phi-4-mm alone requires 5 GGUFs to run both audio+vision, so having ability to export single GGUF is still much better than nothing. |
convert_hf_to_gguf.py
Outdated
TEXT = 1 | ||
VISION = 2 | ||
|
||
|
||
AnyModel = TypeVar("AnyModel", bound="type[Model]") | ||
|
||
|
||
class Model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it is better to name this base class ModelBase
? (Or maybe something more function-oriented, like ModelWriter
or ModelConverter
)
This is to prevent other opening PRs from accidentally keeping their models inherit Model
upon merging to master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, renaming Model
is a good idea since it did change role. ModelBase
seems good.
That will also trigger type errors in convert_lora_to_gguf.py
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated both convert_lora
and convert_hf
scripts in 93b5f71
convert_hf_to_gguf.py
Outdated
undo_permute = False | ||
ignore_vision = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anywhere
ignore_vision = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the conversation, but Llama 4 vision support also seem to be a low-hanging fruit. They no longer use cross-attn like in llama 3, here it's just simple embeddings passed from encoder to decoder, so also would be a nice thing to try out.
(Noting here so I remember)
convert_hf_to_gguf.py
Outdated
class Mistral3Model(LlamaModel): | ||
model_arch = gguf.MODEL_ARCH.LLAMA | ||
|
||
# we need to merge the text_config into the root level of hparams | ||
def __init__(self, *args, **kwargs): | ||
hparams = kwargs["hparams"] if "hparams" in kwargs else Model.load_hparams(args[0]) | ||
hparams = kwargs["hparams"] if "hparams" in kwargs else ModelBase.load_hparams(args[0]) | ||
if "text_config" in hparams: | ||
hparams = {**hparams, **hparams["text_config"]} | ||
kwargs["hparams"] = hparams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be necessary anymore since ModelBase.load_hparams
handles text_config
since d5e03e6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, removed in e37dec6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never looked into Pixtral / Mistral "vision" series until now. They use a quite clever approach, add a special text token in-between image embeddings to preserve the spatial info. Seems like a low-hanging fruit to try
Co-authored-by: compilade <git@compilade.net>
* convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * fix Mistral3Model * fix typo Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
* convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * fix Mistral3Model * fix typo Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
This PR brings first support for getting
mmproj
file directly fromconvert_hf_to_gguf.py
script.It is enabled via an additional flag
--mmproj
. Once specified, the multimodal projector will be converted to a dedicated GGUF file, eliminate the need of having a surgery script and a dedicated mmproj conversion script.Currently, ONLY gemma 3 is supported.
Demo:
Design decisions:
MODEL_ARCH.CLIP_VISION
is added to allow using thetensor_mapping.py
infrastructure. This may look quite ugly, but currently it's the easiest wayMultimodalModel
class. This reduce the complexity of the mainModel
object, but I could be wrong. To be discussed