gguf : add findNearestQuantType #1421

ngxson · 2025-05-02T15:07:15Z

In this PR:

Move GGMLFileQuantizationType to tasks
Update the list of GGMLFileQuantizationType (NOTE: a File can contains multiple quants, for example Q4_K_M File is Q4_K + Q6_K)
For GGMLQuantizationType, add TQ1_0 and TQ2_0 tenary quants
Add findNearestQuantType (see below)

findNearestQuantType

This function is useful for /v2 registry endpoint with text + vision models, in case we want to pick the correspond vision model that can be paired with a text model.

The main issue is that text model can go lower than Q4, like Q3/2/1, but it is not the case for vision model, as vision model is quite sensitive to quantization.

On @bartowski1182 repos , most of vision models will have BF16, F16 and maybe Q8_0 versions. The idea is:

If user pick BF16/F16/Q8_0 text model, we pair it to the correspond BF16/F16/Q8_0
If user pick something else, like Q4_K_M, we find the nearest quant to be paired with. It's Q8_0 in this case

ngxson · 2025-05-02T15:08:21Z

packages/tasks/src/gguf.ts

+
+// This function finds the nearest quantization type that is less than or equal to the given quantization type.
+// It returns undefined if no such quantization type is found.
+export function findNearestQuantType(


fully disclosure: this function is written by gemini 2.5 pro 😂

bartowski1182 · 2025-05-02T15:09:58Z

Is it necessarily true that a user may want to save ~400mb on the vision part (by going to Q8_0) if they choose a smaller quant?

Though I'm guessing this may be an optional flag?

Super interesting for the f16 vs bf16!

ngxson · 2025-05-02T15:15:12Z

Is it necessarily true that a user may want to save ~400mb on the vision part (by going to Q8_0) if they choose a smaller quant?

Hmm yeah that's a valid question, I usually use Q8_0 when developing locally because kernels for Q8_0 are significantly faster than F16.

Even without that, saving 400MB is also significant (Ofc this is quite subjective, but personally I think 400MB is big for a vision model 😂 )

Though I'm guessing this may be an optional flag?

This may not be needed, because if you think the F16 vs Q8_0 is not diff too much, you can simply skip produce Q8 for it.

For example an if you have 400MB model in F16, then converting to Q8 will save 200MB, this is not worth the saving.

But let's see if people think otherwise, we can iterate on this later on.

bartowski1182 · 2025-05-02T16:01:28Z

Fair enough!

Just thinking for people who might download multiple and want to test if they see a difference, it would be good to have a CLI flag that allows you to specify which specific mmproj you want to load up

I also may be mistaken and that's exactly what we can already do, I haven't opened up the code for this PR yet and probably won't till I'm at a computer

ngxson · 2025-05-02T21:06:07Z

packages/gguf/src/gguf.spec.ts

+	it("should find the nearest quant (vision model)", () => {
+		const visionQuants = [GGMLFileQuantizationType.Q8_0, GGMLFileQuantizationType.F16, GGMLFileQuantizationType.BF16];
+		let nearestQuant;
+		// text = Q4_K_M
+		nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q4_K_M, visionQuants);
+		expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0);
+		// text = Q8_0
+		nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q8_0, visionQuants);
+		expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0);
+		// text = F16
+		nearestQuant = findNearestQuantType(GGMLFileQuantizationType.F16, visionQuants);
+		expect(nearestQuant).toEqual(GGMLFileQuantizationType.F16);
+	});


Btw @bartowski1182 , this test case is inspired from a real world scenario where we have vision quantized to F16/BF16/Q8_0, and the text can be anything else.

Feel free to suggest other test cases if you can think of any!

ngxson · 2025-05-02T21:10:41Z

Just thinking for people who might download multiple and want to test if they see a difference, it would be good to have a CLI flag that allows you to specify which specific mmproj you want to load up

If you use llama-mtmd-cli, the -hf will respect --mmproj and --mmproj-url if you explicitly specify it. So if user wants to try different combinations, they can download mmproj files locally and use it via --mmproj

Alternatively, we can add support for "component" in tag name, like model:Q4_K_M+vQ8_0 if user wants to pair Q4 text with Q8 vision. But tbh this is quite confusing and I think downloading it locally (as said above) is more intuitive

gary149 · 2025-05-05T14:08:47Z

packages/tasks/src/gguf.ts

+// order of quantization, from biggest to smallest
+// this list must be in sync with the order in GGMLFileQuantizationType
+// the gguf.spec.ts tests are using verify if the order is correct
+export const GGUF_QUANT_ORDER: GGMLFileQuantizationType[] = [


btw interested in improving the ordering in the quant selector here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF?local-app=llama.cpp

Yeah that's a good idea. This list is already exported and ready to be used in hub UI, do you think of any other improvements ?

No I think we can start with it (maybe there's a few variants missing)

I already sync'ed this list with latest llama.cpp code, so it should be good

ngxson · 2025-05-06T12:58:22Z

I'm merging this PR later today (unless someone oppose)

Linter tests passed in #1421, but then they fail with `Forbidden non-null assertion` on `main` after merging new PRs.

ngxson added 2 commits May 2, 2025 17:01

gguf : add findNearestQuantType

b0fd9ff

GGML_QUANT_SIZES TQx

dde10f5

ngxson requested review from mishig25, julien-c, SBrandeis, gary149, Wauplin and pcuenca as code owners May 2, 2025 15:07

ngxson commented May 2, 2025

View reviewed changes

ngxson added 2 commits May 2, 2025 17:09

format code

0e292ec

Merge branch 'main' into xsn/gguf_update_quants

0fcb504

ngxson commented May 2, 2025

View reviewed changes

gary149 reviewed May 5, 2025

View reviewed changes

ngxson merged commit dea956e into huggingface:main May 6, 2025
4 checks passed

pcuenca added a commit that referenced this pull request May 7, 2025

Fix lint from #1421

21b8c86

pcuenca mentioned this pull request May 7, 2025

Fix lint from #1421 #1437

Merged

coyotte508 pushed a commit that referenced this pull request May 7, 2025

Fix lint from #1421 (#1437)

9edd34d

Linter tests passed in #1421, but then they fail with `Forbidden non-null assertion` on `main` after merging new PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf : add findNearestQuantType #1421

gguf : add findNearestQuantType #1421

ngxson commented May 2, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

bartowski1182 commented May 2, 2025

ngxson commented May 2, 2025 •

edited

Loading

bartowski1182 commented May 2, 2025

ngxson May 2, 2025

ngxson commented May 2, 2025 •

edited

Loading

gary149 May 5, 2025 •

edited

Loading

ngxson May 5, 2025

gary149 May 5, 2025

ngxson May 5, 2025

ngxson commented May 6, 2025

gguf : add findNearestQuantType #1421

gguf : add findNearestQuantType #1421

Conversation

ngxson commented May 2, 2025 • edited Loading

findNearestQuantType

ngxson May 2, 2025 • edited Loading

Choose a reason for hiding this comment

bartowski1182 commented May 2, 2025

ngxson commented May 2, 2025 • edited Loading

bartowski1182 commented May 2, 2025

ngxson May 2, 2025

Choose a reason for hiding this comment

ngxson commented May 2, 2025 • edited Loading

gary149 May 5, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson May 5, 2025

Choose a reason for hiding this comment

gary149 May 5, 2025

Choose a reason for hiding this comment

ngxson May 5, 2025

Choose a reason for hiding this comment

ngxson commented May 6, 2025

ngxson commented May 2, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

ngxson commented May 2, 2025 •

edited

Loading

ngxson commented May 2, 2025 •

edited

Loading

gary149 May 5, 2025 •

edited

Loading