llava : introduce libmtmd #12849

ngxson · 2025-04-09T10:35:13Z

This PR introduce a new library called mtmd (lib MulTi-MoDal), which aims to provide an unified vision API, eliminates the need of a new CLI for each model.

(naming: because both libmm and libmmd are all taken, so I took libmtmd)

Motivation

While there are already some works on #11292, I think it is still far from finish. The problems are:

It is very hard to design a proper vision API (or more general, multimodal API) without prior trial & error
Even if we could do that, it will take too much time and in the meantime, many new models are being released which will make past API designs to be outdated
And most importantly, users want to use it today, not in weeks or even months

So the philosophy that I'm proposing here is to:

Continue experimenting on the existing code (while making it more stable) --> main reason why I designed mtmd
This API should be easy enough to implement on llama-server (YES, finally!!)
Once we are happy with it, and also once we have APIs like llama_batch_ext ready, we will move this API into libllama

Design goals of mtmd

It will exposes both c++ and c-style APIs in one single header. For now, only c++ is supported, but c-style will be added soon.
It is linked against llama.h and clip.h
It aims to support more than just text+image. Many audio-input models are already there (like phi-4-mm and qwen-omni). Have a look at mtmd_input_chunk to see how it is handled
The API should resemble the one from llama : second attempt to refactor vision API #11292 , so in future we can have a drop-in replacement

On a high level, here is how it works:

User provide an input prompt with "markers". Currently we support <__image__> marker, so an example prompt could be: user: what is the difference between this image <__image__> and this image <__image__>
mtmd_tokenize tokenize the prompt into chunks, for example with the prompt above, we have 5 chunks:
- text: user: what is the difference between this image <start_of_image>
- image tokens
- text: <end_of_image> and this image <start_of_image>
- image tokens
- text: <end_of_image>
mtmd_encode() is called on image chunks, the output can be fetched using mtmd_get_output_embd()
Run mtmd_decode() to decode the output embeddings

NOTE:

To support audio, we can simply add a new marker <__audio__> and update mtmd_tokenize, mtmd_encode to support it
The tokens around images like <start_of_image>, <end_of_image> are specific to a given model, mtmd_tokenize must know about which model it is working on
There is a helper function mtmd_helper_eval that allow user to do step 3+4 more easily

Code demo

llama_model * lmodel = ...; // this is the text model
mtmd_context_ptr ctx_vision = mtmd_init_from_file("mmproj.gguf", lmodel, params);

// prepare text
mtmd_input_text text;
text.text          = "What is this: <__image__>\n";
text.add_special   = add_bos;
text.parse_special = true;

// read image
mtmd_bitmap bitmap;
mtmd_bitmap_init_from_file("my_image.jpg", bitmap);

// tokenize everything
std::vector<mtmd_input_chunk> chunks;
mtmd_tokenize(ctx_vision, chunks, text, {bitmap});

// encode it
// NOTE: this may need to be repeated for each chunk of image
mtmd_encode(ctx_vision, chunks[1]);
float * embd = mtmd_get_output_embd(ctx_vision);

// create batch and decode
// NOTE: this may need to be repeated for each chunk of text and image
llama_batch batch = ...; // put embd into the batch
llama_decode(lctx, batch);

What I've done in this PR:

Add mtmd.cpp implementation and C++ header
Add cmakelists config for it
Migrated gemma3-cli.cpp to using it

ngxson · 2025-04-09T12:56:28Z

examples/llava/CMakeLists.txt

    target_compile_options(llava PRIVATE -Wno-cast-qual) # stb_image.h
+    target_compile_options(llava2 PRIVATE -Wno-cast-qual) # stb_image.h


Probably better to wrap stb_image.h into a compilation unit (will do this in another PR). The only functionality we use from stb_image.h is to decode image into bitmap anyway, so I think it's fine to place it behind a wrapper

ngxson · 2025-04-09T12:57:16Z

examples/llava/clip-impl.h

+struct clip_image_u8_deleter {
+    void operator()(clip_image_u8 * val) { clip_image_u8_free(val); }
+};
+
+struct clip_image_f32_deleter {
+    void operator()(clip_image_f32 * val) { clip_image_f32_free(val); }
+};
+
+struct clip_image_f32_batch_deleter {
+    void operator()(clip_image_f32_batch * val) { clip_image_f32_batch_free(val); }
+};
+
+typedef std::unique_ptr<clip_image_u8, clip_image_u8_deleter> clip_image_u8_ptr;
+typedef std::unique_ptr<clip_image_f32, clip_image_f32_deleter> clip_image_f32_ptr;
+typedef std::unique_ptr<clip_image_f32_batch, clip_image_f32_batch_deleter> clip_image_f32_batch_ptr;
+
+// TODO @ngxson : we're currently having a naming clash between struct clip_image_size and function clip_image_size()


This is what I was talking about in #12834 (comment)

In a follow-up PR, I'll use this inside clip.cpp

cmp-nct · 2025-04-09T21:18:46Z

It's nice to see some progress here. Please also look into the backend support of clip, it's just a two-liner to add CUDA and similar acceleration to CLIP but that code was disabled as people kept creating issues for some of the more exotic accelerators failing.

Another thing is the naming convention. LLAVA-2 is actually the name of LLAVA-Next I think and it's just one specific visual architecture but one of many.
Some developers/researchers did not like to have "llava" in the name of their architecture integration.
I recommend choosing a name that's not "branded", like MM_VISION or similar.

ngxson · 2025-04-10T07:05:28Z

It's nice to see some progress here. Please also look into the backend support of clip, it's just a two-liner to add CUDA and similar acceleration to CLIP but that code was disabled as people kept creating issues for some of the more exotic accelerators failing.

You mean this? #12322

I recommend choosing a name that's not "branded", like MM_VISION or similar.

I was thinking about libmm / mm.cpp but it can be confused with mul_mat in this context (i.e. gemm/sgemm/dgemm)

But after second thought, I think it's ok since people have been using mm_ prefix for multimodal projector tensor name. WDYT @ggerganov ?

The second best choice could be mmd, mmdal or mtmd

ggerganov · 2025-04-10T08:22:59Z

@ngxson This seems like a good proposal. This approach would allow us to make faster progress on adding multi-modality support to the examples without being too worried for breaking the libllama API in the process and would provide a lot of useful feedback about the best way to extend the API in the future.

It will exposes both c++ and c-style APIs in one single header. For now, only c++ is supported, but c-style will be added soon.

Yes, so at least at the start, 3rd-party apps would not be able to use this multi-modality functionality through a low-level C-style API, but it would be available through the server at some point. Later on, we will expand the C API as needed. I think that eventually libllama should be able to support all the multi-modal functionality.

Btw, maybe it's a good time to finally start separating the actual "tools" like llama-server from the "examples".

Overall I feel positive about this suggestion as this will lift a lot of the pressure to make various important changes in the libllama API and the KV cache implementation that are needed to unblock the multi-modality implementation. These changes are important and they need time to make them right and avoid going back and breaking the developer experience. But at the same time, we don't want to hold back the addition of new interesting features such as multi-modality for too long.

@slaren Would appreciate to hear your opinion on this proposal. Do you have any recommendations/concerns?

ngxson · 2025-04-10T09:55:54Z

Seems like libmm and libmmd are all taken, but libmtmd is not, so I'm going with this name.

examples/llava/CMakeLists.txt

examples/llava/llava2.h

isaac-mcfadyen · 2025-04-10T15:14:16Z

Currently we support <__image__> marker, so an example prompt could be: user: what is the difference between this image <__image__> and this image <__image__>

Not sure if this is an implementation detail or something exposed to users in any form, but a lot of models already use <|image|>, ex. LLaMa 3.2, Phi 4 (with a number added denoting the image number), and Qwen2.5 VL (without the || braces).

Is there a specific reason that <__image__> was chosen?

ngxson · 2025-04-10T15:22:35Z

Is there a specific reason that <__image__> was chosen?

No one is using it so that's why I took it. But this can be customize when loading the model, there is a param for that.

This token is used internally anyway (via chat template), so not important which is used.

If we want even more future proof, we can even generate a random token each time, something like: <image_a6625fa>, then pass the token via mtmd_bitmap object

examples/llava/gemma3-cli.cpp

ggerganov · 2025-04-10T17:07:43Z

examples/llava/CMakeLists.txt

+add_library(mtmd_static STATIC $<TARGET_OBJECTS:mtmd>)
+if (BUILD_SHARED_LIBS)
+    set_target_properties(mtmd PROPERTIES POSITION_INDEPENDENT_CODE ON)
+    target_compile_definitions(mtmd PRIVATE LLAMA_SHARED LLAMA_BUILD)
+    add_library(mtmd_shared SHARED $<TARGET_OBJECTS:mtmd>)
+    target_link_libraries(mtmd_shared PRIVATE ggml llama ${CMAKE_THREAD_LIBS_INIT})
+    install(TARGETS mtmd_shared LIBRARY)
+endif()
+


I'm not really sure what is the need for building both static and shared libs here. Probably something to revisit and simplify.

examples/llava/mtmd.cpp

ggerganov · 2025-04-10T17:18:07Z

examples/llava/gemma3-cli.cpp

+    mtmd_input_text text;
+    text.text          = formatted_chat.prompt;
+    text.add_special   = add_bos;
+    text.parse_special = true;
+    mtmd_input_chunks_ptr chunks(mtmd_tokenize(ctx.ctx_vision.get(), text, bitmaps));
+    if (chunks == nullptr) {
+        LOG_ERR("Unable to tokenize prompt\n");
+        return 1;
+    }
+
+    if (mtmd_helper_eval(ctx.ctx_vision.get(), ctx.lctx, chunks.get(), ctx.n_past, 0, ctx.n_batch)) {
+        LOG_ERR("Unable to eval prompt\n");
+        return 1;
+    }
+


I think the main question will be how the interaction between the libmtmd and libllama will work. In the case of Gemma 3, we already have the necessary API so all is good and we don't need changes in libllama to make it work. But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.

In the short-term, my idea is to have libllama using input from libmtmd. In other words, libllama only need to knows about text token and embeddings, while libmtmd will handle image/audio/etc and convert them into token/embd.

This approach will make these 2 library completely separated for now. When be bring mtmd into libllama, we will mostly focus how to pass data between the 2 parts more efficiently (for ex, pass by tensor as @slaren suggested)

For now, some mtmd_helper_* functions will be added, which may use libllama under the hood. Indeed, this is because we don't yet have a notion similar to common.h specifically for mtmd, but it will be simple to add in the future.

But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.

This is something I'll try to avoid at this stage, probably even avoid by using hooks/hacks like what I did in my CSM impl. Then when more models using it, we can make an API for it

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* wip llava2 * migrated gemma3 to llava2 * add timings * correct pre/postfix * fix missing include * fix compilation unused var warn * update llava2_tokenize * change name llava2 --> mtmd * improve api * refine helpers * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson added 2 commits April 9, 2025 09:32

wip llava2

235340d

migrated gemma3 to llava2

96bf95e

ngxson requested a review from ggerganov April 9, 2025 10:35

github-actions bot added the examples label Apr 9, 2025

ngxson added 3 commits April 9, 2025 14:40

Merge branch 'master' into xsn/llava2

94564ac

add timings

7cc4108

correct pre/postfix

a9ef623

ngxson commented Apr 9, 2025

View reviewed changes

ngxson added 3 commits April 9, 2025 15:00

fix missing include

3b25bd9

fix compilation unused var warn

1576c82

update llava2_tokenize

117bf73

ngxson mentioned this pull request Apr 9, 2025

server: Bring back multimodal support #8010

Open

19 tasks

ngxson changed the title ~~llava : introduce llava2 library~~ llava : introduce libmm Apr 10, 2025

ngxson changed the title ~~llava : introduce libmm~~ llava : introduce libmtmd Apr 10, 2025

change name llava2 --> mtmd

a6625fa

slaren reviewed Apr 10, 2025

View reviewed changes

examples/llava/CMakeLists.txt Outdated Show resolved Hide resolved

examples/llava/llava2.h Outdated Show resolved Hide resolved

examples/llava/llava2.h Outdated Show resolved Hide resolved

examples/llava/llava2.h Outdated Show resolved Hide resolved

improve api

430dbd8

ngxson requested a review from slaren April 10, 2025 13:50

slaren approved these changes Apr 10, 2025

View reviewed changes

refine helpers

6ed09b7

ggerganov approved these changes Apr 10, 2025

View reviewed changes

Update examples/llava/mtmd.cpp

aed3216

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson merged commit 8b9cc7c into ggml-org:master Apr 10, 2025
51 checks passed

ngxson mentioned this pull request Apr 11, 2025

server : (experimental) vision support via libmtmd #12898

Draft

5 tasks

pminev mentioned this pull request Apr 15, 2025

Update submodules alpaca-core/ilib-llama.cpp#69

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava : introduce libmtmd #12849

llava : introduce libmtmd #12849

ngxson commented Apr 9, 2025 •

edited

Loading

ngxson Apr 9, 2025

ngxson Apr 9, 2025 •

edited

Loading

cmp-nct commented Apr 9, 2025

ngxson commented Apr 10, 2025 •

edited

Loading

ggerganov commented Apr 10, 2025

ngxson commented Apr 10, 2025

isaac-mcfadyen commented Apr 10, 2025 •

edited

Loading

ngxson commented Apr 10, 2025 •

edited

Loading

ggerganov Apr 10, 2025

ggerganov Apr 10, 2025

ngxson Apr 10, 2025 •

edited

Loading

ngxson Apr 10, 2025 •

edited

Loading

		target_compile_options(llava PRIVATE -Wno-cast-qual) # stb_image.h
		target_compile_options(llava2 PRIVATE -Wno-cast-qual) # stb_image.h

llava : introduce libmtmd #12849

llava : introduce libmtmd #12849

Conversation

ngxson commented Apr 9, 2025 • edited Loading

Motivation

Design goals of mtmd

Code demo

What I've done in this PR:

ngxson Apr 9, 2025

Choose a reason for hiding this comment

ngxson Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

cmp-nct commented Apr 9, 2025

ngxson commented Apr 10, 2025 • edited Loading

ggerganov commented Apr 10, 2025

ngxson commented Apr 10, 2025

isaac-mcfadyen commented Apr 10, 2025 • edited Loading

ngxson commented Apr 10, 2025 • edited Loading

ggerganov Apr 10, 2025

Choose a reason for hiding this comment

ggerganov Apr 10, 2025

Choose a reason for hiding this comment

ngxson Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson commented Apr 9, 2025 •

edited

Loading

ngxson Apr 9, 2025 •

edited

Loading

ngxson commented Apr 10, 2025 •

edited

Loading

isaac-mcfadyen commented Apr 10, 2025 •

edited

Loading

ngxson commented Apr 10, 2025 •

edited

Loading

ngxson Apr 10, 2025 •

edited

Loading

ngxson Apr 10, 2025 •

edited

Loading