-
Notifications
You must be signed in to change notification settings - Fork 372
gguf : add findNearestQuantType #1421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
// This function finds the nearest quantization type that is less than or equal to the given quantization type. | ||
// It returns undefined if no such quantization type is found. | ||
export function findNearestQuantType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fully disclosure: this function is written by gemini 2.5 pro 😂
Is it necessarily true that a user may want to save ~400mb on the vision part (by going to Q8_0) if they choose a smaller quant? Though I'm guessing this may be an optional flag? Super interesting for the f16 vs bf16! |
Hmm yeah that's a valid question, I usually use Q8_0 when developing locally because kernels for Q8_0 are significantly faster than F16. Even without that, saving 400MB is also significant (Ofc this is quite subjective, but personally I think 400MB is big for a vision model 😂 )
This may not be needed, because if you think the F16 vs Q8_0 is not diff too much, you can simply skip produce Q8 for it. For example an if you have 400MB model in F16, then converting to Q8 will save 200MB, this is not worth the saving. But let's see if people think otherwise, we can iterate on this later on. |
Fair enough! Just thinking for people who might download multiple and want to test if they see a difference, it would be good to have a CLI flag that allows you to specify which specific mmproj you want to load up I also may be mistaken and that's exactly what we can already do, I haven't opened up the code for this PR yet and probably won't till I'm at a computer |
it("should find the nearest quant (vision model)", () => { | ||
const visionQuants = [GGMLFileQuantizationType.Q8_0, GGMLFileQuantizationType.F16, GGMLFileQuantizationType.BF16]; | ||
let nearestQuant; | ||
// text = Q4_K_M | ||
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q4_K_M, visionQuants); | ||
expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0); | ||
// text = Q8_0 | ||
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q8_0, visionQuants); | ||
expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0); | ||
// text = F16 | ||
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.F16, visionQuants); | ||
expect(nearestQuant).toEqual(GGMLFileQuantizationType.F16); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw @bartowski1182 , this test case is inspired from a real world scenario where we have vision quantized to F16/BF16/Q8_0, and the text can be anything else.
Feel free to suggest other test cases if you can think of any!
If you use Alternatively, we can add support for "component" in tag name, like |
// order of quantization, from biggest to smallest | ||
// this list must be in sync with the order in GGMLFileQuantizationType | ||
// the gguf.spec.ts tests are using verify if the order is correct | ||
export const GGUF_QUANT_ORDER: GGMLFileQuantizationType[] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw interested in improving the ordering in the quant selector here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF?local-app=llama.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good idea. This list is already exported and ready to be used in hub UI, do you think of any other improvements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think we can start with it (maybe there's a few variants missing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already sync'ed this list with latest llama.cpp code, so it should be good
I'm merging this PR later today (unless someone oppose) |
In this PR:
GGMLFileQuantizationType
totasks
GGMLFileQuantizationType
(NOTE: a File can contains multiple quants, for example Q4_K_M File is Q4_K + Q6_K)GGMLQuantizationType
, add TQ1_0 and TQ2_0 tenary quantsfindNearestQuantType
(see below)findNearestQuantType
This function is useful for
/v2
registry endpoint with text + vision models, in case we want to pick the correspond vision model that can be paired with a text model.The main issue is that text model can go lower than Q4, like Q3/2/1, but it is not the case for vision model, as vision model is quite sensitive to quantization.
On @bartowski1182 repos , most of vision models will have BF16, F16 and maybe Q8_0 versions. The idea is: