Skip to content

llava : introduce libmtmd #12849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 10, 2025
Merged

llava : introduce libmtmd #12849

merged 12 commits into from
Apr 10, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 9, 2025

This PR introduce a new library called mtmd (lib MulTi-MoDal), which aims to provide an unified vision API, eliminates the need of a new CLI for each model.

(naming: because both libmm and libmmd are all taken, so I took libmtmd)

Motivation

While there are already some works on #11292, I think it is still far from finish. The problems are:

  • It is very hard to design a proper vision API (or more general, multimodal API) without prior trial & error
  • Even if we could do that, it will take too much time and in the meantime, many new models are being released which will make past API designs to be outdated
  • And most importantly, users want to use it today, not in weeks or even months

So the philosophy that I'm proposing here is to:

  1. Continue experimenting on the existing code (while making it more stable) --> main reason why I designed mtmd
  2. This API should be easy enough to implement on llama-server (YES, finally!!)
  3. Once we are happy with it, and also once we have APIs like llama_batch_ext ready, we will move this API into libllama

Design goals of mtmd

  1. It will exposes both c++ and c-style APIs in one single header. For now, only c++ is supported, but c-style will be added soon.
  2. It is linked against llama.h and clip.h
  3. It aims to support more than just text+image. Many audio-input models are already there (like phi-4-mm and qwen-omni). Have a look at mtmd_input_chunk to see how it is handled
  4. The API should resemble the one from llama : second attempt to refactor vision API #11292 , so in future we can have a drop-in replacement

On a high level, here is how it works:

  1. User provide an input prompt with "markers". Currently we support <__image__> marker, so an example prompt could be: user: what is the difference between this image <__image__> and this image <__image__>
  2. mtmd_tokenize tokenize the prompt into chunks, for example with the prompt above, we have 5 chunks:
    • text: user: what is the difference between this image <start_of_image>
    • image tokens
    • text: <end_of_image> and this image <start_of_image>
    • image tokens
    • text: <end_of_image>
  3. mtmd_encode() is called on image chunks, the output can be fetched using mtmd_get_output_embd()
  4. Run mtmd_decode() to decode the output embeddings

NOTE:

  • To support audio, we can simply add a new marker <__audio__> and update mtmd_tokenize, mtmd_encode to support it
  • The tokens around images like <start_of_image>, <end_of_image> are specific to a given model, mtmd_tokenize must know about which model it is working on
  • There is a helper function mtmd_helper_eval that allow user to do step 3+4 more easily

Code demo

llama_model * lmodel = ...; // this is the text model
mtmd_context_ptr ctx_vision = mtmd_init_from_file("mmproj.gguf", lmodel, params);

// prepare text
mtmd_input_text text;
text.text          = "What is this: <__image__>\n";
text.add_special   = add_bos;
text.parse_special = true;

// read image
mtmd_bitmap bitmap;
mtmd_bitmap_init_from_file("my_image.jpg", bitmap);

// tokenize everything
std::vector<mtmd_input_chunk> chunks;
mtmd_tokenize(ctx_vision, chunks, text, {bitmap});

// encode it
// NOTE: this may need to be repeated for each chunk of image
mtmd_encode(ctx_vision, chunks[1]);
float * embd = mtmd_get_output_embd(ctx_vision);

// create batch and decode
// NOTE: this may need to be repeated for each chunk of text and image
llama_batch batch = ...; // put embd into the batch
llama_decode(lctx, batch);

What I've done in this PR:

  • Add mtmd.cpp implementation and C++ header
  • Add cmakelists config for it
  • Migrated gemma3-cli.cpp to using it

@ngxson ngxson requested a review from ggerganov April 9, 2025 10:35
Comment on lines 55 to 56
target_compile_options(llava PRIVATE -Wno-cast-qual) # stb_image.h
target_compile_options(llava2 PRIVATE -Wno-cast-qual) # stb_image.h
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to wrap stb_image.h into a compilation unit (will do this in another PR). The only functionality we use from stb_image.h is to decode image into bitmap anyway, so I think it's fine to place it behind a wrapper

Comment on lines +204 to +220
struct clip_image_u8_deleter {
void operator()(clip_image_u8 * val) { clip_image_u8_free(val); }
};

struct clip_image_f32_deleter {
void operator()(clip_image_f32 * val) { clip_image_f32_free(val); }
};

struct clip_image_f32_batch_deleter {
void operator()(clip_image_f32_batch * val) { clip_image_f32_batch_free(val); }
};

typedef std::unique_ptr<clip_image_u8, clip_image_u8_deleter> clip_image_u8_ptr;
typedef std::unique_ptr<clip_image_f32, clip_image_f32_deleter> clip_image_f32_ptr;
typedef std::unique_ptr<clip_image_f32_batch, clip_image_f32_batch_deleter> clip_image_f32_batch_ptr;

// TODO @ngxson : we're currently having a naming clash between struct clip_image_size and function clip_image_size()
Copy link
Collaborator Author

@ngxson ngxson Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was talking about in #12834 (comment)

In a follow-up PR, I'll use this inside clip.cpp

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 9, 2025

It's nice to see some progress here. Please also look into the backend support of clip, it's just a two-liner to add CUDA and similar acceleration to CLIP but that code was disabled as people kept creating issues for some of the more exotic accelerators failing.

Another thing is the naming convention. LLAVA-2 is actually the name of LLAVA-Next I think and it's just one specific visual architecture but one of many.
Some developers/researchers did not like to have "llava" in the name of their architecture integration.
I recommend choosing a name that's not "branded", like MM_VISION or similar.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 10, 2025

It's nice to see some progress here. Please also look into the backend support of clip, it's just a two-liner to add CUDA and similar acceleration to CLIP but that code was disabled as people kept creating issues for some of the more exotic accelerators failing.

You mean this? #12322

I recommend choosing a name that's not "branded", like MM_VISION or similar.

I was thinking about libmm / mm.cpp but it can be confused with mul_mat in this context (i.e. gemm/sgemm/dgemm)

But after second thought, I think it's ok since people have been using mm_ prefix for multimodal projector tensor name. WDYT @ggerganov ?

The second best choice could be mmd, mmdal or mtmd

@ggerganov
Copy link
Member

@ngxson This seems like a good proposal. This approach would allow us to make faster progress on adding multi-modality support to the examples without being too worried for breaking the libllama API in the process and would provide a lot of useful feedback about the best way to extend the API in the future.

It will exposes both c++ and c-style APIs in one single header. For now, only c++ is supported, but c-style will be added soon.

Yes, so at least at the start, 3rd-party apps would not be able to use this multi-modality functionality through a low-level C-style API, but it would be available through the server at some point. Later on, we will expand the C API as needed. I think that eventually libllama should be able to support all the multi-modal functionality.

Btw, maybe it's a good time to finally start separating the actual "tools" like llama-server from the "examples".

Overall I feel positive about this suggestion as this will lift a lot of the pressure to make various important changes in the libllama API and the KV cache implementation that are needed to unblock the multi-modality implementation. These changes are important and they need time to make them right and avoid going back and breaking the developer experience. But at the same time, we don't want to hold back the addition of new interesting features such as multi-modality for too long.

@slaren Would appreciate to hear your opinion on this proposal. Do you have any recommendations/concerns?

@ngxson ngxson changed the title llava : introduce llava2 library llava : introduce libmm Apr 10, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 10, 2025

Seems like libmm and libmmd are all taken, but libmtmd is not, so I'm going with this name.

@ngxson ngxson changed the title llava : introduce libmm llava : introduce libmtmd Apr 10, 2025
@ngxson ngxson requested a review from slaren April 10, 2025 13:50
@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Apr 10, 2025

  • Currently we support <__image__> marker, so an example prompt could be: user: what is the difference between this image <__image__> and this image <__image__>

Not sure if this is an implementation detail or something exposed to users in any form, but a lot of models already use <|image|>, ex. LLaMa 3.2, Phi 4 (with a number added denoting the image number), and Qwen2.5 VL (without the || braces).

Is there a specific reason that <__image__> was chosen?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 10, 2025

Is there a specific reason that <__image__> was chosen?

No one is using it so that's why I took it. But this can be customize when loading the model, there is a param for that.

This token is used internally anyway (via chat template), so not important which is used.

If we want even more future proof, we can even generate a random token each time, something like: <image_a6625fa>, then pass the token via mtmd_bitmap object

Comment on lines +45 to +53
add_library(mtmd_static STATIC $<TARGET_OBJECTS:mtmd>)
if (BUILD_SHARED_LIBS)
set_target_properties(mtmd PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(mtmd PRIVATE LLAMA_SHARED LLAMA_BUILD)
add_library(mtmd_shared SHARED $<TARGET_OBJECTS:mtmd>)
target_link_libraries(mtmd_shared PRIVATE ggml llama ${CMAKE_THREAD_LIBS_INIT})
install(TARGETS mtmd_shared LIBRARY)
endif()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what is the need for building both static and shared libs here. Probably something to revisit and simplify.

Comment on lines +183 to +197
mtmd_input_text text;
text.text = formatted_chat.prompt;
text.add_special = add_bos;
text.parse_special = true;
mtmd_input_chunks_ptr chunks(mtmd_tokenize(ctx.ctx_vision.get(), text, bitmaps));
if (chunks == nullptr) {
LOG_ERR("Unable to tokenize prompt\n");
return 1;
}

if (mtmd_helper_eval(ctx.ctx_vision.get(), ctx.lctx, chunks.get(), ctx.n_past, 0, ctx.n_batch)) {
LOG_ERR("Unable to eval prompt\n");
return 1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question will be how the interaction between the libmtmd and libllama will work. In the case of Gemma 3, we already have the necessary API so all is good and we don't need changes in libllama to make it work. But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.

Copy link
Collaborator Author

@ngxson ngxson Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the short-term, my idea is to have libllama using input from libmtmd. In other words, libllama only need to knows about text token and embeddings, while libmtmd will handle image/audio/etc and convert them into token/embd.

This approach will make these 2 library completely separated for now. When be bring mtmd into libllama, we will mostly focus how to pass data between the 2 parts more efficiently (for ex, pass by tensor as @slaren suggested)

For now, some mtmd_helper_* functions will be added, which may use libllama under the hood. Indeed, this is because we don't yet have a notion similar to common.h specifically for mtmd, but it will be simple to add in the future.

Copy link
Collaborator Author

@ngxson ngxson Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.

This is something I'll try to avoid at this stage, probably even avoid by using hooks/hacks like what I did in my CSM impl. Then when more models using it, we can make an API for it

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ngxson ngxson merged commit 8b9cc7c into ggml-org:master Apr 10, 2025
51 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* wip llava2

* migrated gemma3 to llava2

* add timings

* correct pre/postfix

* fix missing include

* fix compilation unused var warn

* update llava2_tokenize

* change name llava2 --> mtmd

* improve api

* refine helpers

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants