-
Notifications
You must be signed in to change notification settings - Fork 11.7k
ggml-quants : weighted rounding algorithms with cumulative search #12557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Slightly faster than the previous method.
Weirdly, it seems like in practice replacing this instance is not better. This is probably because of its interaction with make_qkx3_quants.
There seem to be some problems with both Metal and Vulkan tests when copying
I think this may be caused by the new quantization algorithm for I'm not sure how to fix this other than making the CPU quantization for |
This looks really interesting and will read though it this week if I get time - I'm still keen to see if we can find a better way to regularise the weights. You might find this interesting:
This bothered me as I couldn't see any good reason why it shouldn't be (after recentering) an odd function of the form It's not as clear in this form, but it actually parameterises a whole family of odd functions: (which may be useful to make an extension to the With some more manipulation you get: Which IIRC is related to a well known approximation to the symmetric beta quantile function (inverse CDF) and has been discussed on John D. Cook's blog before: https://www.johndcook.com/blog/ (which is sadly so badly organised it's near impossible to find again lol) Anyway, thought it might be interesting since you are looking at the rounding - it may be that it now comes out as an odd function if you were to rerun the k-means clustering on the fixed rounding? |
Can you explain how you created the plots in more detail? My father was a geography teacher and map projections were one of his favourite topics, but I still can't 100% see the relationship here! :) Are the "maximums" you are referring to the edges of the cube? I can see we could create a 2D heightmap of This is really fascinating BTW! |
@jukofyork The weighted cosine similarly only affects the color gradient of the plot. |
@ikawrakow kindly explained where this came from here: The reason this bothers me so much is because the formula doesn't act as a regulariser at the two extremes:
Then if you look at the experts in a MoE model, we should be weighing more or less towards the prior depending on the relative sample sizes, and so on. Or put another way: there should be a tunable There are a multitude of different ways you can estimate the optimal or the textbooks by James E. Gentle, for an overview of this. I'm going to dip out now as like I said in the other thread; I've nothing to gain from this and may have come across badly, which certainly wasn't my intention! :) I think the work @ikawrakow did on the quants in |
I think that the CPY operations that involve quantization of the source data should remain simple because these are difficult to implement efficiently on the GPU and other devices. So using the fast shorcut-taking implementation during copy should be the better options here. |
I did a quick perplexity test with a base Gemma 3 4B and observe improvement for
Though I agree that KLD is a better metric to track, especially for tuned models. I think after we resolve the failing tests, we can proceed to merge. Great work on this @compilade! |
ikawrakow had some extensive comments on this at ikawrakow/ik_llama.cpp#288 (comment) for example, he points out that the IQ4_NL changes make it 5x slower, without apparent benefit. |
I'm aware that my initial approach is too slow and too exhaustive, and I'm working on making it faster by reducing the range of the cumulative search. I'll run more tests before pushing here, but in the meantime a faster version is available in https://github.com/compilade/llama.cpp/tree/compilade/optimal-rounding. Not sure yet how it affects perplexity and KL-divergence, which I will test soon-ish (and I'll also update the equirectangular plots). I'm changing this to a draft until I make the proper changes here (and also those related to |
There's been quite a few posts about the new QAT method, but this one today gave me an idea related to your pictures:
I must admit I still don't fully get what the pictures are showing, but I do wonder if the calculation used to generate them could actually be used to create a custom regularisation function which could be added to the loss to drive the weights towards the bin centres of a chosen K-quant (or legacy quant). Sadly the Wikipedia page explains this horribly, but it's actually not hard at all to drive weights to different values than zero: sometimes you can do it via transformations (ie: for log-normal scale priors), but also directly by changing the gradient formula to take the difference from your chosen value instead of zero. I'm not 100% sure it would work as each valley in your pictures may create a very hard to escape local minima (ie: a bit like trying to fit points to a sine wave), but you could solve this using other means (like annealing the lambda or random restarts). Can you adapt your calculation to give a level of "K_X-quant-ish-ness", with zero being perfect 1:1 mapping between the real-valued values and the final K_X-quant that will result? |
Yes! This is actually pretty much what is done to make the plots, except it's weighted cosine similarly instead of weighted squared error (but both are related). Some thing to note though: I did not yet generalize this to the K-quants which have a minimum ( I'm currently experimenting with a much faster sorting algorithm (since the scale-sorting step is the main bottleneck in these cumulative search algorithms), but once I'm done with that, I'll try generalizing to offset quantization.
Hmm, I wonder if this being weighed rounding could help with some of this. Finding the "importance" of the channels while training may or may not be easy, though, and may not help at all. In that case, neutral weights of 1 can be used. I think this would not be more prone to being stuck than the other QAT error functions, but I could be wrong. I'll see what is required for a PyTorch module exposing these quantization algorithms to help with testing this idea. Currently it does kind of work with Numpy in https://github.com/compilade/rounding-experiments, but the functions are not very ergonomic since the purpose of the bindings was mostly to simplify plotting the errors and not much else. |
Are you sorting a fixed set of 32 values? If so then have you heard of sorting networks: https://en.m.wikipedia.org/wiki/Sorting_network https://bertdobbelaere.github.io/sorting_networks.html You can turn this into a C macro:
and it will compile down to basically be the fastest 32 element sorting algorithm possible. |
It's nearly 20 years ago now, but you can read how we used sorting networks to speed up poker hand evaluations here: (sadly some of the images will no longer show, but should still make sense I think) |
It definitely would have some problems due to local minima - just think about when a weight is right on the boundary of 2 bins: The gradient might pull it to one side or the other, and then the "attractor" at the bin centre will pull it towards it (as opposed to standard Tikhonov regularization which will have one clear attractor it heads towards regardless of the actual gradient's pull). Or from a Bayesian perspective:
It probably won't be a huge problem, but it will get stuck (you can see something similar when you run the EM algorithm on Gaussian mixture models or k-means clustering [which is a restricted version of the EM algorithm]). It may be more of a problem due to the sinusoidal patterns though - as in gradient space these will be very narrow valleys that require many/all variables to move in lockstep. It certainly would be an interesting thing to try! |
Not quite, it's sorting up to It's not always a multiple of the sub-block size, because the search doesn't necessarily start at the first scale (because the first half are redundant with the second half for linear quants), and doesn't necessarily end at the last scale (because of some clamping criterion which gave good results). The algorithm I've been trying is a hybrid of a non-comparative partial sort (e.g. counting sort) and an adaptive comparative sort algorithm (e.g. insertion sort). It seems promising for now, but I did not try with actual model weights yet to compare with the heap sort currently implemented in this PR. I did not finish adapting Sorting networks do seem extremely interesting, but they are not particularly easy to use in this case, unless what is sorted has a more constant size. But I will need to make a constant-sort-size version of these algorithms anyway to implement them in
Sure, but current QAT algorithms also need to handle this problem since they mostly use absmax round-to-nearest quantization.
My cumulative search algorithms were made to ensure that if a values is on a boundary of 2 bins, either way should result in the same weighted error. I think it would be possible to prove this behavior with exact numbers, or within a small epsilon with floating point numbers.
The sinusoidal pattern might be due to the unwrapped equirectangular projections. When viewed inside the sphere they represent, there is no sinusoidal pattern: https://blobs.compilade.net/pannellum.htm#panorama=equirectangular-qkxh-2048.png (Unless you meant the bumpy pattern made by the quantization error, which does have ridges kind of like For sure there are many places where the variables need to move in lockstep and/or get stuck, but that is also true of the more widely-used absmax quantization. I really appreciate your insights, @jukofyork! I have some reading to do on Tikhonov regularization. |
This adds proper
imatrix
support toTQ1_0
andTQ2_0
, in addition to improving the rounding algorithm used forQ3_K
,IQ4_NL
,IQ4_XS
(both with and withoutimatrix
), as well as when usingimatrix
withQ4_0
andQ5_0
.This is backward and forward compatible with other versions of
llama.cpp
.Since this doesn't change the format of the types, only how the values are rounded when quantized, even previous (or current) versions of
llama.cpp
can use quants made with this PR.Affected types
When using
imatrix
, all the types mentionned in the table below are affected.When not using
imatrix
, a change was only made where "Yes" is in the table below.imatrix
TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
KL-Divergence
The following tests were made with
wiki.test.raw
fromwikitext-2-raw
, using chunks of 512 tokens.Quantization was done using the
imatrix
files made by @bartowski1182.Since this doesn't affect how
imatrix
files are made, older ones can still be used for quantization.Important
All the following tests use PURE quantization to avoid testing multiple changed types at once, to be sure that the changes are measured on their own.
$ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>
Qwen2.5-Coder-3B-Instruct
With
imatrix
from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/blob/main/Qwen2.5-Coder-3B-Instruct.imatrix:KL-divergence (lower is better):
TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
*:
TQ1_0
andTQ2_0
kl-divergence was calculated on the first 8 chunks.Note how
Q3_K
was previously very broken for this model. There was a reddit thread about brokenQ3_K
for this model.Full KL-Divergence results
TQ1_0-master
:TQ1_0-PR
:TQ2_0-master
:TQ2_0-PR
:Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Q4_0-master
:Q4_0-PR
:Q5_0-master
:Q5_0-PR
:Without
imatrix
:(lower is better)
Q3_K
IQ4_NL
IQ4_XS
The other types were not changed.
Full KL-Divergence results
Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Llama-3.1-8B-Instruct
Same tests, using
Llama-3.1-8B-Instruct
, withimatrix
from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct.imatrixAgain, note that the quantizations here are pure (without mixing types apart from
Q8_0
token embeddings and output tensor).KL-divergence on
wiki.test.raw
(lower is better):TQ1_0
TQ2_0
Q3_K
IQ4_NL
IQ4_XS
Q4_0
Q5_0
*:
TQ1_0
andTQ2_0
kl-divergence was calculated on the first 8 chunks.Full KL-Divergence results
TQ1_0-master
:TQ1_0-PR
:TQ2_0-master
:TQ2_0-PR
:Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:Q4_0-master
:Q4_0-PR
:Q5_0-master
:Q5_0-PR
:Without
imatrix
:Q3_K
IQ4_NL
IQ4_XS
Full KL-Divergence results
Q3_K-master
:Q3_K-PR
:IQ4_NL-master
:IQ4_NL-PR
:IQ4_XS-master
:IQ4_XS-PR
:The improvements are more apparent with
imatrix
, where this is a strict improvement.Without
imatrix
, it's a bit less clear.What changed in the algorithms?
There's a neat way to visualize rounding algorithms with equirectangular projections of their errors in a particular 3D space.
Here's an equirectangular projection from the algorithm used in
Q4_0
(which uses integers between-8
and7
):This plots the weighted cosine similarity between the quantized vectors and the full-precision vectors which correspond to each pixel of the projection.
Less error is more yellow, while more error is more blue.
Unless otherwise noted, the projections I'm including here always use$w_i = 1$ .
Note that this doesn't fully capture the behavior of more complex rounding algorithms at higher dimensions, since this fundamentally is a 3D view of the rounding space (which in practice is more like 16D, 32D, or even 256D), but it is enough to make some problems more easily identifiable.
Non-ideal rounding algorithms have discontinuities in their weighted cosine similarity plots
(for
Q4_0
, the bluer line is caused by how the max scale is handled since #729)Algorithms used on
master
Let's start with what is used in the types on
master
, so that we have some baseline to compare with.make_q3_quants
This algorithm is used only with
Q3_K
when there is noimatrix
provided.It's a bit broken for some models, notably with
Qwen2.5-Coder-3B-Instruct
.It doesn't seem quite right (this will become clearer later when more ideal algorithms are illustrated).
Notice how vectors with positive or negative maximums are handled completely differently.
In practice, the rounding weights it uses are the square of the vectors, which looks more like this:
make_qx_quants
This algorithm is used in a lot of types.
In this example it's used with
[-8, 7]
as the range of integers:I did not replace all of its uses yet because in some places it's good enough (e.g.
Q6_K
).make_qp_quants
This is almost like
make_qx_quants
, but assumes unsigned quantization (from 0 tonmax
) with a positive scale.That it only works with unsigned quantization makes visualizing this a bit more different, since only the positive quadrant can be explored.
Still, if we limit the viewing range the positive quadrant of a face of a cube, here's what it looks like:
Note that the top left corner is
[1, 0, 0]
, while the bottom right corner is[1, 1, 1]
in this cube face projection.quantize_row_iq4_nl_impl
This is used in both
IQ4_NL
andIQ4_XS
.Notice how there are many discontinuities, although the error is mostly small.
Algorithms from this PR
The weighted vector rounding algorithms I'm introducing all share a similar theory.
It's possible to use a cumulative sum to enumerate all weighted dot products for each distinct initial scales.
This requires sorting the possible inverse scales so that each step changes only a single integer in the candidate quantized vector.
In practice, using a max-heap of the scales seems to be faster than using
qsort
, which is why I've addedstruct k_heap
(which is basically a binary max-heap).I've been exploring this idea in https://github.com/compilade/rounding-experiments, which is also where the equirectangular visualization script comes from (it's
equirectangular.py
in that repo).I will be eventually publishing a more complete explanation of the algorithms, but there are still some unsolved problems like how to generalize this to offset quantization types like
Q4_K
(which loosely have the formq[i] * s - m
).If you'd like to help research into this kind of quantization algorithm, or help formalize this, reach out.
make_qkxh_quants
This is very similar to
make_qx_quants
, but it's doing a more exhaustive cumulative search instead of a grid search (it's still not quite fully exhaustive, but close).It's general enough to be a replacement for both
make_qx_quants
andmake_qp_quants
, since it supports arbitrary min and max representable values, instead of assuming the negative side goes one further than the positive (in the case ofmake_qx_quants
), or assuming the min is zero (formake_qp_quants
).It does assume zero is somewhere part of the representable range, though.
make_qkxsh_quants
This is almost the same as
make_qkxh_quants
, but it has a different behavior for some distributions ofimatrix
weights where the best sign for the scale is not the sign of the absolute max value.For example, when the representable integer range is
[-2, 7]
, and the the weights are[1, 8, 8]
instead of[1, 1, 1]
,make_qkxh_quants
shows some discontinuities at the boundaries where the max changes.But
make_qkxsh_quants
doesn't have this problem:In practice, though, it doesn't seem to impact the quality of the quantization that much, except for very asymmetric types.
This is used with
TQ2_0
withimatrix
, since it's quite asymmetric, because it can store{-1, 0, 1, 2}
.make_qkxh_nl_quants
A more exhaustive general non-linear quantization function (which can technically be used for more than just the
IQ4_NL
kvalues if other non-linear types are introduced).There are some variants.
One doesn't assume the sign of the best scale.
This is the slowest, but highest quality, and is used when an
imatrix
file is provided.Another one assumes the sign of the best scale should make the absolute max value have the same sign as the absolute max
kvalue
of the non-linear mapping.This is used when no
imatrix
is provided, since it's faster than trying both signs.Notes
qw[i] * (sigma2 + x[i] * x[i])
, which I think may be interesting for @jukofyorkQ3_K
withimatrix
), this may affect decisions regarding types chosen in mixes (@ddh0, @KerfuffleV2, @Nexesenex)make_qkxh_nl_quants
is general enough to be useful in @ikawrakow's other non-linear types too (IQ2_K
,IQ3_K
, etc., in https://github.com/ikawrakow/ik_llama.cpp), although since they use multiple lookup tables for some types instead of only one, it might be more complicated than forIQ4_NL
(and need some modifications).qsort
instead of a binary max-heap might be easier to understand, and were last in these lines from an older commit in this PR:llama.cpp/ggml/src/ggml-quants.c
Lines 631 to 1107 in 0c9e442
TODO in future PRs
make_qx_quants
andmake_qp_quants
withmake_qkxh_quants
IQ1_S
,IQ1_M
,IQ2_XXS
,IQ2_XS
,IQ2_S
,IQ3_XXS
,IQ3_S
)make_qkx3_quants
(q[i] * s - m
quants)qw[i] * (sigma2 + x[i] * x[i])
if possibleTQ2_0
in a quant mix (it's nearIQ1_S
quality-wise, but faster)Make sure to read the contributing guidelines before submitting a PR