-
Notifications
You must be signed in to change notification settings - Fork 11.5k
llava : introduce libmtmd #12849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llava : introduce libmtmd #12849
Conversation
examples/llava/CMakeLists.txt
Outdated
target_compile_options(llava PRIVATE -Wno-cast-qual) # stb_image.h | ||
target_compile_options(llava2 PRIVATE -Wno-cast-qual) # stb_image.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to wrap stb_image.h
into a compilation unit (will do this in another PR). The only functionality we use from stb_image.h
is to decode image into bitmap anyway, so I think it's fine to place it behind a wrapper
struct clip_image_u8_deleter { | ||
void operator()(clip_image_u8 * val) { clip_image_u8_free(val); } | ||
}; | ||
|
||
struct clip_image_f32_deleter { | ||
void operator()(clip_image_f32 * val) { clip_image_f32_free(val); } | ||
}; | ||
|
||
struct clip_image_f32_batch_deleter { | ||
void operator()(clip_image_f32_batch * val) { clip_image_f32_batch_free(val); } | ||
}; | ||
|
||
typedef std::unique_ptr<clip_image_u8, clip_image_u8_deleter> clip_image_u8_ptr; | ||
typedef std::unique_ptr<clip_image_f32, clip_image_f32_deleter> clip_image_f32_ptr; | ||
typedef std::unique_ptr<clip_image_f32_batch, clip_image_f32_batch_deleter> clip_image_f32_batch_ptr; | ||
|
||
// TODO @ngxson : we're currently having a naming clash between struct clip_image_size and function clip_image_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was talking about in #12834 (comment)
In a follow-up PR, I'll use this inside clip.cpp
It's nice to see some progress here. Please also look into the backend support of clip, it's just a two-liner to add CUDA and similar acceleration to CLIP but that code was disabled as people kept creating issues for some of the more exotic accelerators failing. Another thing is the naming convention. LLAVA-2 is actually the name of LLAVA-Next I think and it's just one specific visual architecture but one of many. |
You mean this? #12322
I was thinking about But after second thought, I think it's ok since people have been using The second best choice could be |
@ngxson This seems like a good proposal. This approach would allow us to make faster progress on adding multi-modality support to the examples without being too worried for breaking the
Yes, so at least at the start, 3rd-party apps would not be able to use this multi-modality functionality through a low-level C-style API, but it would be available through the server at some point. Later on, we will expand the C API as needed. I think that eventually Btw, maybe it's a good time to finally start separating the actual "tools" like Overall I feel positive about this suggestion as this will lift a lot of the pressure to make various important changes in the @slaren Would appreciate to hear your opinion on this proposal. Do you have any recommendations/concerns? |
Not sure if this is an implementation detail or something exposed to users in any form, but a lot of models already use Is there a specific reason that |
No one is using it so that's why I took it. But this can be customize when loading the model, there is a param for that. This token is used internally anyway (via chat template), so not important which is used. If we want even more future proof, we can even generate a random token each time, something like: |
add_library(mtmd_static STATIC $<TARGET_OBJECTS:mtmd>) | ||
if (BUILD_SHARED_LIBS) | ||
set_target_properties(mtmd PROPERTIES POSITION_INDEPENDENT_CODE ON) | ||
target_compile_definitions(mtmd PRIVATE LLAMA_SHARED LLAMA_BUILD) | ||
add_library(mtmd_shared SHARED $<TARGET_OBJECTS:mtmd>) | ||
target_link_libraries(mtmd_shared PRIVATE ggml llama ${CMAKE_THREAD_LIBS_INIT}) | ||
install(TARGETS mtmd_shared LIBRARY) | ||
endif() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what is the need for building both static and shared libs here. Probably something to revisit and simplify.
mtmd_input_text text; | ||
text.text = formatted_chat.prompt; | ||
text.add_special = add_bos; | ||
text.parse_special = true; | ||
mtmd_input_chunks_ptr chunks(mtmd_tokenize(ctx.ctx_vision.get(), text, bitmaps)); | ||
if (chunks == nullptr) { | ||
LOG_ERR("Unable to tokenize prompt\n"); | ||
return 1; | ||
} | ||
|
||
if (mtmd_helper_eval(ctx.ctx_vision.get(), ctx.lctx, chunks.get(), ctx.n_past, 0, ctx.n_batch)) { | ||
LOG_ERR("Unable to eval prompt\n"); | ||
return 1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main question will be how the interaction between the libmtmd
and libllama
will work. In the case of Gemma 3, we already have the necessary API so all is good and we don't need changes in libllama
to make it work. But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the short-term, my idea is to have libllama
using input from libmtmd
. In other words, libllama
only need to knows about text token and embeddings, while libmtmd
will handle image/audio/etc and convert them into token/embd.
This approach will make these 2 library completely separated for now. When be bring mtmd into libllama, we will mostly focus how to pass data between the 2 parts more efficiently (for ex, pass by tensor as @slaren suggested)
For now, some mtmd_helper_*
functions will be added, which may use libllama
under the hood. Indeed, this is because we don't yet have a notion similar to common.h
specifically for mtmd, but it will be simple to add in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for other models, there will be changes needed to the API and this is where we have to carefully think it through and it might take more effort to provide the necessary support.
This is something I'll try to avoid at this stage, probably even avoid by using hooks/hacks like what I did in my CSM impl. Then when more models using it, we can make an API for it
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* wip llava2 * migrated gemma3 to llava2 * add timings * correct pre/postfix * fix missing include * fix compilation unused var warn * update llava2_tokenize * change name llava2 --> mtmd * improve api * refine helpers * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR introduce a new library called
mtmd
(lib MulTi-MoDal), which aims to provide an unified vision API, eliminates the need of a new CLI for each model.(naming: because both libmm and libmmd are all taken, so I took
libmtmd
)Motivation
While there are already some works on #11292, I think it is still far from finish. The problems are:
So the philosophy that I'm proposing here is to:
mtmd
llama-server
(YES, finally!!)llama_batch_ext
ready, we will move this API into libllamaDesign goals of mtmd
llama.h
andclip.h
mtmd_input_chunk
to see how it is handledOn a high level, here is how it works:
<__image__>
marker, so an example prompt could be:user: what is the difference between this image <__image__> and this image <__image__>
mtmd_tokenize
tokenize the prompt into chunks, for example with the prompt above, we have 5 chunks:user: what is the difference between this image <start_of_image>
<end_of_image> and this image <start_of_image>
<end_of_image>
mtmd_encode()
is called on image chunks, the output can be fetched usingmtmd_get_output_embd()
mtmd_decode()
to decode the output embeddingsNOTE:
<__audio__>
and updatemtmd_tokenize
,mtmd_encode
to support it<start_of_image>
,<end_of_image>
are specific to a given model,mtmd_tokenize
must know about which model it is working onmtmd_helper_eval
that allow user to do step 3+4 more easilyCode demo
What I've done in this PR:
mtmd.cpp
implementation and C++ headergemma3-cli.cpp
to using it