ggml-quants : weighted rounding algorithms with cumulative search #12557

compilade · 2025-03-25T01:39:26Z

This adds proper imatrix support to TQ1_0 and TQ2_0, in addition to improving the rounding algorithm used for Q3_K, IQ4_NL, IQ4_XS (both with and without imatrix), as well as when using imatrix with Q4_0 and Q5_0.

This is backward and forward compatible with other versions of llama.cpp.
Since this doesn't change the format of the types, only how the values are rounded when quantized, even previous (or current) versions of llama.cpp can use quants made with this PR.

Affected types

When using imatrix, all the types mentionned in the table below are affected.

When not using imatrix, a change was only made where "Yes" is in the table below.

Type	also when not using `imatrix`
`TQ1_0`	No
`TQ2_0`	No
`Q3_K`	Yes
`IQ4_NL`	Yes
`IQ4_XS`	Yes
`Q4_0`	No
`Q5_0`	No

KL-Divergence

The following tests were made with wiki.test.raw from wikitext-2-raw, using chunks of 512 tokens.

Quantization was done using the imatrix files made by @bartowski1182.
Since this doesn't affect how imatrix files are made, older ones can still be used for quantization.

Important

All the following tests use PURE quantization to avoid testing multiple changed types at once, to be sure that the changes are measured on their own.

$ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>

`Qwen2.5-Coder-3B-Instruct`

With imatrix from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/blob/main/Qwen2.5-Coder-3B-Instruct.imatrix:

KL-divergence (lower is better):

Type	KLD (master)	KLD (this PR)
`TQ1_0`	12.551568*	7.739003*
`TQ2_0`	12.551568*	2.242612*
`Q3_K`	1.357501	0.111468
`IQ4_NL`	0.026068	0.024147
`IQ4_XS`	0.026373	0.025058
`Q4_0`	0.047984	0.034300
`Q5_0`	0.010146	0.008771

*: TQ1_0 and TQ2_0 kl-divergence was calculated on the first 8 chunks.

Note how Q3_K was previously very broken for this model. There was a reddit thread about broken Q3_K for this model.

Full KL-Divergence results

TQ1_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   : 2182162.789434 _ 162360.828026
Mean PPL(base)                :  10.907921 _   0.693594
Cor(ln(PPL(Q)), ln(PPL(base))):  12.52%
Mean ln(PPL(Q)/PPL(base))     :  12.206338 _   0.091623
Mean PPL(Q)/PPL(base)         : 200053.035957 _ 18329.429870
Mean PPL(Q)-PPL(base)         : 2182151.881512 _ 162360.741210

====== KL divergence statistics ======
Mean    KLD:  12.551568 _   0.060399
Maximum KLD:  24.766617
99.9%   KLD:  24.148331
99.0%   KLD:  20.619076
99.0%   KLD:  20.619076
Median  KLD:  12.196516
10.0%   KLD:   9.414867
 5.0%   KLD:   8.703196
 1.0%   KLD:   7.416174
Minimum KLD:   5.071042

====== Token probability statistics ======
Mean    _p: -39.948 _ 0.839 %
Maximum _p:  0.083%
99.9%   _p:  0.007%
99.0%   _p: -0.000%
95.0%   _p: -0.020%
90.0%   _p: -0.157%
75.0%   _p: -3.055%
Median  _p: -26.538%
25.0%   _p: -79.074%
10.0%   _p: -98.890%
 5.0%   _p: -99.825%
 1.0%   _p: -99.995%
 0.1%   _p: -99.999%
Minimum _p: -100.000%
RMS _p    : 55.067 _ 0.756 %
Same top p:  0.000 _ 0.000 %

TQ1_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   : 16132.310898 _ 1599.737539
Mean PPL(base)                :  10.907921 _   0.693594
Cor(ln(PPL(Q)), ln(PPL(base))):  24.08%
Mean ln(PPL(Q)/PPL(base))     :   7.299090 _   0.104114
Mean PPL(Q)/PPL(base)         : 1478.953719 _ 153.979855
Mean PPL(Q)-PPL(base)         : 16121.402977 _ 1599.570652

====== KL divergence statistics ======
Mean    KLD:   7.739003 _   0.084901
Maximum KLD:  22.631401
99.9%   KLD:  22.318825
99.0%   KLD:  17.484526
99.0%   KLD:  17.484526
Median  KLD:   6.970903
10.0%   KLD:   3.253742
 5.0%   KLD:   2.671062
 1.0%   KLD:   1.286910
Minimum KLD:   0.041438

====== Token probability statistics ======
Mean    _p: -38.474 _ 0.825 %
Maximum _p: 49.130%
99.9%   _p: 15.705%
99.0%   _p:  0.608%
95.0%   _p: -0.003%
90.0%   _p: -0.091%
75.0%   _p: -2.676%
Median  _p: -24.959%
25.0%   _p: -75.649%
10.0%   _p: -97.555%
 5.0%   _p: -99.629%
 1.0%   _p: -99.988%
 0.1%   _p: -99.998%
Minimum _p: -100.000%
RMS _p    : 53.562 _ 0.751 %
Same top p:  4.559 _ 0.462 %

TQ2_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   : 2182162.789434 _ 162360.828026
Mean PPL(base)                :  10.907921 _   0.693594
Cor(ln(PPL(Q)), ln(PPL(base))):  12.52%
Mean ln(PPL(Q)/PPL(base))     :  12.206338 _   0.091623
Mean PPL(Q)/PPL(base)         : 200053.035957 _ 18329.429870
Mean PPL(Q)-PPL(base)         : 2182151.881512 _ 162360.741210

====== KL divergence statistics ======
Mean    KLD:  12.551568 _   0.060399
Maximum KLD:  24.766617
99.9%   KLD:  24.148331
99.0%   KLD:  20.619076
99.0%   KLD:  20.619076
Median  KLD:  12.196516
10.0%   KLD:   9.414867
 5.0%   KLD:   8.703196
 1.0%   KLD:   7.416174
Minimum KLD:   5.071042

====== Token probability statistics ======
Mean    _p: -39.948 _ 0.839 %
Maximum _p:  0.083%
99.9%   _p:  0.007%
99.0%   _p: -0.000%
95.0%   _p: -0.020%
90.0%   _p: -0.157%
75.0%   _p: -3.055%
Median  _p: -26.538%
25.0%   _p: -79.074%
10.0%   _p: -98.890%
 5.0%   _p: -99.825%
 1.0%   _p: -99.995%
 0.1%   _p: -99.999%
Minimum _p: -100.000%
RMS _p    : 55.067 _ 0.756 %
Same top p:  0.000 _ 0.000 %

TQ2_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  79.526671 _   5.827061
Mean PPL(base)                :  10.907921 _   0.693594
Cor(ln(PPL(Q)), ln(PPL(base))):  70.22%
Mean ln(PPL(Q)/PPL(base))     :   1.986603 _   0.053561
Mean PPL(Q)/PPL(base)         :   7.290726 _   0.390495
Mean PPL(Q)-PPL(base)         :  68.618749 _   5.362802

====== KL divergence statistics ======
Mean    KLD:   2.242612 _   0.036832
Maximum KLD:  13.314784
99.9%   KLD:  11.469497
99.0%   KLD:   7.835283
99.0%   KLD:   7.835283
Median  KLD:   1.886513
10.0%   KLD:   0.513139
 5.0%   KLD:   0.300165
 1.0%   KLD:   0.050036
Minimum KLD:   0.006086

====== Token probability statistics ======
Mean    _p: -25.755 _ 0.686 %
Maximum _p: 65.291%
99.9%   _p: 38.577%
99.0%   _p: 21.158%
95.0%   _p:  4.255%
90.0%   _p:  0.131%
75.0%   _p: -0.613%
Median  _p: -13.638%
25.0%   _p: -47.147%
10.0%   _p: -76.766%
 5.0%   _p: -88.700%
 1.0%   _p: -98.357%
 0.1%   _p: -99.885%
Minimum _p: -99.987%
RMS _p    : 40.292 _ 0.702 %
Same top p: 34.314 _ 1.051 %

Q3_K-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  36.934779 _   0.331902
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  72.46%
Mean ln(PPL(Q)/PPL(base))     :   1.256099 _   0.006251
Mean PPL(Q)/PPL(base)         :   3.511696 _   0.021950
Mean PPL(Q)-PPL(base)         :  26.417132 _   0.280945

====== KL divergence statistics ======
Mean    KLD:   1.357501 _   0.005333
Maximum KLD:  22.465849
99.9%   KLD:  13.392083
99.0%   KLD:   9.365643
99.0%   KLD:   9.365643
Median  KLD:   0.532961
10.0%   KLD:   0.044704
 5.0%   KLD:   0.012538
 1.0%   KLD:   0.001470
Minimum KLD:   0.000003

====== Token probability statistics ======
Mean    _p: -14.868 _ 0.067 %
Maximum _p: 93.696%
99.9%   _p: 53.084%
99.0%   _p: 25.083%
95.0%   _p:  7.710%
90.0%   _p:  2.106%
75.0%   _p: -0.026%
Median  _p: -3.428%
25.0%   _p: -21.581%
10.0%   _p: -54.983%
 5.0%   _p: -78.485%
 1.0%   _p: -98.302%
 0.1%   _p: -99.960%
Minimum _p: -99.999%
RMS _p    : 29.987 _ 0.088 %
Same top p: 53.925 _ 0.129 %

Q3_K-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  11.960798 _   0.090273
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  98.09%
Mean ln(PPL(Q)/PPL(base))     :   0.128580 _   0.001469
Mean PPL(Q)/PPL(base)         :   1.137212 _   0.001671
Mean PPL(Q)-PPL(base)         :   1.443151 _   0.020816

====== KL divergence statistics ======
Mean    KLD:   0.111468 _   0.000463
Maximum KLD:   8.136190
99.9%   KLD:   1.986575
99.0%   KLD:   0.814349
99.0%   KLD:   0.814349
Median  KLD:   0.068064
10.0%   KLD:   0.001768
 5.0%   KLD:   0.000337
 1.0%   KLD:   0.000030
Minimum KLD:  -0.000043

====== Token probability statistics ======
Mean    _p: -1.837 _ 0.023 %
Maximum _p: 77.187%
99.9%   _p: 36.868%
99.0%   _p: 19.443%
95.0%   _p:  8.564%
90.0%   _p:  4.379%
75.0%   _p:  0.401%
Median  _p: -0.079%
25.0%   _p: -2.870%
10.0%   _p: -10.161%
 5.0%   _p: -16.651%
 1.0%   _p: -36.288%
 0.1%   _p: -70.554%
Minimum _p: -98.546%
RMS _p    :  9.039 _ 0.049 %
Same top p: 83.617 _ 0.096 %

IQ4_NL-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.803001 _   0.079867
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.53%
Mean ln(PPL(Q)/PPL(base))     :   0.026769 _   0.000719
Mean PPL(Q)/PPL(base)         :   1.027131 _   0.000738
Mean PPL(Q)-PPL(base)         :   0.285354 _   0.008046

====== KL divergence statistics ======
Mean    KLD:   0.026068 _   0.000120
Maximum KLD:   3.393933
99.9%   KLD:   0.494545
99.0%   KLD:   0.193592
99.0%   KLD:   0.193592
Median  KLD:   0.015549
10.0%   KLD:   0.000411
 5.0%   KLD:   0.000080
 1.0%   KLD:   0.000006
Minimum KLD:  -0.000125

====== Token probability statistics ======
Mean    _p: -0.463 _ 0.011 %
Maximum _p: 65.619%
99.9%   _p: 22.750%
99.0%   _p: 11.118%
95.0%   _p:  4.935%
90.0%   _p:  2.677%
75.0%   _p:  0.377%
Median  _p: -0.008%
25.0%   _p: -1.065%
10.0%   _p: -4.168%
 5.0%   _p: -6.826%
 1.0%   _p: -14.684%
 0.1%   _p: -32.400%
Minimum _p: -82.545%
RMS _p    :  4.213 _ 0.028 %
Same top p: 91.713 _ 0.071 %

IQ4_NL-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.792518 _   0.079761
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.55%
Mean ln(PPL(Q)/PPL(base))     :   0.025799 _   0.000701
Mean PPL(Q)/PPL(base)         :   1.026134 _   0.000719
Mean PPL(Q)-PPL(base)         :   0.274871 _   0.007832

====== KL divergence statistics ======
Mean    KLD:   0.024147 _   0.000111
Maximum KLD:   3.422835
99.9%   KLD:   0.472022
99.0%   KLD:   0.182245
99.0%   KLD:   0.182245
Median  KLD:   0.014287
10.0%   KLD:   0.000375
 5.0%   KLD:   0.000070
 1.0%   KLD:   0.000005
Minimum KLD:  -0.000125

====== Token probability statistics ======
Mean    _p: -0.386 _ 0.011 %
Maximum _p: 74.728%
99.9%   _p: 21.639%
99.0%   _p: 10.667%
95.0%   _p:  4.979%
90.0%   _p:  2.743%
75.0%   _p:  0.399%
Median  _p: -0.008%
25.0%   _p: -0.952%
10.0%   _p: -3.845%
 5.0%   _p: -6.482%
 1.0%   _p: -14.462%
 0.1%   _p: -31.078%
Minimum _p: -81.052%
RMS _p    :  4.075 _ 0.027 %
Same top p: 92.025 _ 0.070 %

IQ4_XS-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.812454 _   0.080001
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.52%
Mean ln(PPL(Q)/PPL(base))     :   0.027644 _   0.000725
Mean PPL(Q)/PPL(base)         :   1.028030 _   0.000745
Mean PPL(Q)-PPL(base)         :   0.294807 _   0.008154

====== KL divergence statistics ======
Mean    KLD:   0.026373 _   0.000125
Maximum KLD:   3.891487
99.9%   KLD:   0.515714
99.0%   KLD:   0.197520
99.0%   KLD:   0.197520
Median  KLD:   0.015503
10.0%   KLD:   0.000421
 5.0%   KLD:   0.000076
 1.0%   KLD:   0.000005
Minimum KLD:  -0.000190

====== Token probability statistics ======
Mean    _p: -0.445 _ 0.011 %
Maximum _p: 80.655%
99.9%   _p: 23.952%
99.0%   _p: 11.110%
95.0%   _p:  4.957%
90.0%   _p:  2.715%
75.0%   _p:  0.385%
Median  _p: -0.008%
25.0%   _p: -1.049%
10.0%   _p: -4.106%
 5.0%   _p: -6.853%
 1.0%   _p: -14.742%
 0.1%   _p: -31.911%
Minimum _p: -83.008%
RMS _p    :  4.247 _ 0.028 %
Same top p: 91.714 _ 0.071 %

IQ4_XS-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.790918 _   0.079810
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.53%
Mean ln(PPL(Q)/PPL(base))     :   0.025650 _   0.000715
Mean PPL(Q)/PPL(base)         :   1.025982 _   0.000733
Mean PPL(Q)-PPL(base)         :   0.273271 _   0.007983

====== KL divergence statistics ======
Mean    KLD:   0.025058 _   0.000121
Maximum KLD:   4.057374
99.9%   KLD:   0.499045
99.0%   KLD:   0.194008
99.0%   KLD:   0.194008
Median  KLD:   0.014766
10.0%   KLD:   0.000381
 5.0%   KLD:   0.000071
 1.0%   KLD:   0.000005
Minimum KLD:  -0.000247

====== Token probability statistics ======
Mean    _p: -0.356 _ 0.011 %
Maximum _p: 80.189%
99.9%   _p: 22.781%
99.0%   _p: 10.951%
95.0%   _p:  5.153%
90.0%   _p:  2.870%
75.0%   _p:  0.433%
Median  _p: -0.006%
25.0%   _p: -0.938%
10.0%   _p: -3.834%
 5.0%   _p: -6.567%
 1.0%   _p: -14.624%
 0.1%   _p: -32.087%
Minimum _p: -82.900%
RMS _p    :  4.165 _ 0.028 %
Same top p: 91.894 _ 0.071 %

Q4_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  11.048685 _   0.082089
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.11%
Mean ln(PPL(Q)/PPL(base))     :   0.049257 _   0.000987
Mean PPL(Q)/PPL(base)         :   1.050490 _   0.001037
Mean PPL(Q)-PPL(base)         :   0.531038 _   0.011610

====== KL divergence statistics ======
Mean    KLD:   0.047984 _   0.000353
Maximum KLD:   9.741309
99.9%   KLD:   1.633823
99.0%   KLD:   0.393176
99.0%   KLD:   0.393176
Median  KLD:   0.025246
10.0%   KLD:   0.000802
 5.0%   KLD:   0.000164
 1.0%   KLD:   0.000011
Minimum KLD:  -0.000063

====== Token probability statistics ======
Mean    _p: -0.772 _ 0.016 %
Maximum _p: 99.287%
99.9%   _p: 29.440%
99.0%   _p: 13.909%
95.0%   _p:  6.107%
90.0%   _p:  3.319%
75.0%   _p:  0.481%
Median  _p: -0.014%
25.0%   _p: -1.368%
10.0%   _p: -5.455%
 5.0%   _p: -9.336%
 1.0%   _p: -22.018%
 0.1%   _p: -58.290%
Minimum _p: -99.981%
RMS _p    :  6.081 _ 0.053 %
Same top p: 89.410 _ 0.080 %

Q4_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.925119 _   0.081169
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.38%
Mean ln(PPL(Q)/PPL(base))     :   0.038010 _   0.000826
Mean PPL(Q)/PPL(base)         :   1.038742 _   0.000858
Mean PPL(Q)-PPL(base)         :   0.407472 _   0.009600

====== KL divergence statistics ======
Mean    KLD:   0.034300 _   0.000153
Maximum KLD:   4.042407
99.9%   KLD:   0.686060
99.0%   KLD:   0.258646
99.0%   KLD:   0.258646
Median  KLD:   0.020158
10.0%   KLD:   0.000557
 5.0%   KLD:   0.000105
 1.0%   KLD:   0.000007
Minimum KLD:  -0.000075

====== Token probability statistics ======
Mean    _p: -0.494 _ 0.012 %
Maximum _p: 65.411%
99.9%   _p: 24.964%
99.0%   _p: 12.356%
95.0%   _p:  5.634%
90.0%   _p:  3.093%
75.0%   _p:  0.482%
Median  _p: -0.006%
25.0%   _p: -1.128%
10.0%   _p: -4.583%
 5.0%   _p: -7.779%
 1.0%   _p: -16.981%
 0.1%   _p: -37.058%
Minimum _p: -90.081%
RMS _p    :  4.786 _ 0.030 %
Same top p: 90.686 _ 0.075 %

Q5_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.602265 _   0.078083
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.80%
Mean ln(PPL(Q)/PPL(base))     :   0.008013 _   0.000468
Mean PPL(Q)/PPL(base)         :   1.008045 _   0.000472
Mean PPL(Q)-PPL(base)         :   0.084618 _   0.004991

====== KL divergence statistics ======
Mean    KLD:   0.010146 _   0.000048
Maximum KLD:   1.426585
99.9%   KLD:   0.200519
99.0%   KLD:   0.076547
99.0%   KLD:   0.076547
Median  KLD:   0.005965
10.0%   KLD:   0.000172
 5.0%   KLD:   0.000036
 1.0%   KLD:   0.000003
Minimum KLD:  -0.000087

====== Token probability statistics ======
Mean    _p: -0.232 _ 0.007 %
Maximum _p: 64.234%
99.9%   _p: 14.624%
99.0%   _p:  7.048%
95.0%   _p:  3.214%
90.0%   _p:  1.783%
75.0%   _p:  0.277%
Median  _p: -0.006%
25.0%   _p: -0.607%
10.0%   _p: -2.489%
 5.0%   _p: -4.147%
 1.0%   _p: -9.078%
 0.1%   _p: -20.588%
Minimum _p: -55.723%
RMS _p    :  2.654 _ 0.019 %
Same top p: 94.804 _ 0.058 %

Q5_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.648180 _   0.078734
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.82%
Mean ln(PPL(Q)/PPL(base))     :   0.012334 _   0.000441
Mean PPL(Q)/PPL(base)         :   1.012411 _   0.000446
Mean PPL(Q)-PPL(base)         :   0.130533 _   0.004840

====== KL divergence statistics ======
Mean    KLD:   0.008771 _   0.000042
Maximum KLD:   2.388952
99.9%   KLD:   0.162432
99.0%   KLD:   0.064442
99.0%   KLD:   0.064442
Median  KLD:   0.005246
10.0%   KLD:   0.000135
 5.0%   KLD:   0.000026
 1.0%   KLD:   0.000002
Minimum KLD:  -0.000241

====== Token probability statistics ======
Mean    _p: -0.169 _ 0.006 %
Maximum _p: 40.663%
99.9%   _p: 13.659%
99.0%   _p:  6.745%
95.0%   _p:  3.123%
90.0%   _p:  1.752%
75.0%   _p:  0.289%
Median  _p: -0.002%
25.0%   _p: -0.534%
10.0%   _p: -2.266%
 5.0%   _p: -3.796%
 1.0%   _p: -8.136%
 0.1%   _p: -18.573%
Minimum _p: -59.782%
RMS _p    :  2.446 _ 0.017 %
Same top p: 95.081 _ 0.056 %

Without imatrix:

(lower is better)

Type	KLD (master)	KLD (this PR)
`Q3_K`	2.836537	1.453521
`IQ4_NL`	0.040589	0.038760
`IQ4_XS`	0.042532	0.039505

The other types were not changed.

Full KL-Divergence results

Q3_K-master:

====== Perplexity statistics ======
Mean PPL(Q)                   : 147.729865 _   1.410801
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  56.49%
Mean ln(PPL(Q)/PPL(base))     :   2.642331 _   0.008120
Mean PPL(Q)/PPL(base)         :  14.045905 _   0.114059
Mean PPL(Q)-PPL(base)         : 137.212218 _   1.368580

====== KL divergence statistics ======
Mean    KLD:   2.836537 _   0.006727
Maximum KLD:  20.421732
99.9%   KLD:  15.531389
99.0%   KLD:  11.327532
99.0%   KLD:  11.327532
Median  KLD:   1.966781
10.0%   KLD:   0.375801
 5.0%   KLD:   0.127698
 1.0%   KLD:   0.010815
Minimum KLD:   0.000021

====== Token probability statistics ======
Mean    _p: -27.061 _ 0.087 %
Maximum _p: 89.037%
99.9%   _p: 49.954%
99.0%   _p: 20.138%
95.0%   _p:  2.569%
90.0%   _p:  0.053%
75.0%   _p: -0.631%
Median  _p: -11.733%
25.0%   _p: -48.791%
10.0%   _p: -86.889%
 5.0%   _p: -97.017%
 1.0%   _p: -99.807%
 0.1%   _p: -99.994%
Minimum _p: -100.000%
RMS _p    : 43.124 _ 0.090 %
Same top p: 31.492 _ 0.120 %

Q3_K-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  40.610850 _   0.361813
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  71.58%
Mean ln(PPL(Q)/PPL(base))     :   1.350981 _   0.006298
Mean PPL(Q)/PPL(base)         :   3.861211 _   0.024319
Mean PPL(Q)-PPL(base)         :  30.093203 _   0.311157

====== KL divergence statistics ======
Mean    KLD:   1.453521 _   0.005175
Maximum KLD:  20.775965
99.9%   KLD:  13.681979
99.0%   KLD:   9.358577
99.0%   KLD:   9.358577
Median  KLD:   0.662432
10.0%   KLD:   0.066226
 5.0%   KLD:   0.018259
 1.0%   KLD:   0.002296
Minimum KLD:   0.000005

====== Token probability statistics ======
Mean    _p: -16.158 _ 0.071 %
Maximum _p: 89.587%
99.9%   _p: 55.147%
99.0%   _p: 27.263%
95.0%   _p:  8.377%
90.0%   _p:  2.219%
75.0%   _p: -0.043%
Median  _p: -4.038%
25.0%   _p: -24.194%
10.0%   _p: -59.911%
 5.0%   _p: -83.816%
 1.0%   _p: -99.317%
 0.1%   _p: -99.982%
Minimum _p: -100.000%
RMS _p    : 31.888 _ 0.090 %
Same top p: 52.002 _ 0.129 %

IQ4_NL-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.920161 _   0.080529
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.24%
Mean ln(PPL(Q)/PPL(base))     :   0.037556 _   0.000905
Mean PPL(Q)/PPL(base)         :   1.038270 _   0.000940
Mean PPL(Q)-PPL(base)         :   0.402514 _   0.010201

====== KL divergence statistics ======
Mean    KLD:   0.040589 _   0.000325
Maximum KLD:  10.075583
99.9%   KLD:   1.234995
99.0%   KLD:   0.298610
99.0%   KLD:   0.298610
Median  KLD:   0.022875
10.0%   KLD:   0.000707
 5.0%   KLD:   0.000141
 1.0%   KLD:   0.000011
Minimum KLD:  -0.000044

====== Token probability statistics ======
Mean    _p: -0.923 _ 0.014 %
Maximum _p: 91.131%
99.9%   _p: 26.737%
99.0%   _p: 12.351%
95.0%   _p:  5.378%
90.0%   _p:  2.779%
75.0%   _p:  0.308%
Median  _p: -0.035%
25.0%   _p: -1.633%
10.0%   _p: -5.700%
 5.0%   _p: -9.124%
 1.0%   _p: -19.370%
 0.1%   _p: -50.705%
Minimum _p: -99.982%
RMS _p    :  5.593 _ 0.050 %
Same top p: 90.059 _ 0.078 %

IQ4_NL-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.883998 _   0.080198
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.31%
Mean ln(PPL(Q)/PPL(base))     :   0.034239 _   0.000867
Mean PPL(Q)/PPL(base)         :   1.034832 _   0.000897
Mean PPL(Q)-PPL(base)         :   0.366351 _   0.009696

====== KL divergence statistics ======
Mean    KLD:   0.038760 _   0.000228
Maximum KLD:   9.805382
99.9%   KLD:   0.960583
99.0%   KLD:   0.307821
99.0%   KLD:   0.307821
Median  KLD:   0.021900
10.0%   KLD:   0.000642
 5.0%   KLD:   0.000128
 1.0%   KLD:   0.000010
Minimum KLD:  -0.000048

====== Token probability statistics ======
Mean    _p: -0.838 _ 0.013 %
Maximum _p: 94.162%
99.9%   _p: 25.763%
99.0%   _p: 12.159%
95.0%   _p:  5.223%
90.0%   _p:  2.741%
75.0%   _p:  0.319%
Median  _p: -0.026%
25.0%   _p: -1.486%
10.0%   _p: -5.395%
 5.0%   _p: -8.888%
 1.0%   _p: -19.026%
 0.1%   _p: -42.770%
Minimum _p: -94.887%
RMS _p    :  5.225 _ 0.036 %
Same top p: 90.270 _ 0.077 %

IQ4_XS-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.910139 _   0.080259
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.20%
Mean ln(PPL(Q)/PPL(base))     :   0.036638 _   0.000930
Mean PPL(Q)/PPL(base)         :   1.037318 _   0.000965
Mean PPL(Q)-PPL(base)         :   0.392492 _   0.010372

====== KL divergence statistics ======
Mean    KLD:   0.042532 _   0.000364
Maximum KLD:  10.214425
99.9%   KLD:   1.545680
99.0%   KLD:   0.324855
99.0%   KLD:   0.324855
Median  KLD:   0.023195
10.0%   KLD:   0.000824
 5.0%   KLD:   0.000174
 1.0%   KLD:   0.000013
Minimum KLD:  -0.000021

====== Token probability statistics ======
Mean    _p: -0.979 _ 0.015 %
Maximum _p: 92.244%
99.9%   _p: 27.470%
99.0%   _p: 12.561%
95.0%   _p:  5.460%
90.0%   _p:  2.796%
75.0%   _p:  0.285%
Median  _p: -0.047%
25.0%   _p: -1.694%
10.0%   _p: -5.779%
 5.0%   _p: -9.268%
 1.0%   _p: -20.108%
 0.1%   _p: -57.129%
Minimum _p: -99.982%
RMS _p    :  5.843 _ 0.054 %
Same top p: 89.938 _ 0.078 %

IQ4_XS-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.895192 _   0.080295
Mean PPL(base)                :  10.517647 _   0.077374
Cor(ln(PPL(Q)), ln(PPL(base))):  99.29%
Mean ln(PPL(Q)/PPL(base))     :   0.035267 _   0.000878
Mean PPL(Q)/PPL(base)         :   1.035896 _   0.000909
Mean PPL(Q)-PPL(base)         :   0.377545 _   0.009837

====== KL divergence statistics ======
Mean    KLD:   0.039505 _   0.000223
Maximum KLD:   8.881559
99.9%   KLD:   0.912418
99.0%   KLD:   0.309623
99.0%   KLD:   0.309623
Median  KLD:   0.022496
10.0%   KLD:   0.000646
 5.0%   KLD:   0.000128
 1.0%   KLD:   0.000010
Minimum KLD:  -0.000073

====== Token probability statistics ======
Mean    _p: -0.857 _ 0.014 %
Maximum _p: 93.796%
99.9%   _p: 26.225%
99.0%   _p: 12.298%
95.0%   _p:  5.273%
90.0%   _p:  2.767%
75.0%   _p:  0.315%
Median  _p: -0.027%
25.0%   _p: -1.522%
10.0%   _p: -5.483%
 5.0%   _p: -9.010%
 1.0%   _p: -19.331%
 0.1%   _p: -42.644%
Minimum _p: -95.451%
RMS _p    :  5.282 _ 0.036 %
Same top p: 90.193 _ 0.077 %

`Llama-3.1-8B-Instruct`

Same tests, using Llama-3.1-8B-Instruct, with imatrix from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct.imatrix

Again, note that the quantizations here are pure (without mixing types apart from Q8_0 token embeddings and output tensor).

KL-divergence on wiki.test.raw (lower is better):

Type	KLD (master)	KLD (this PR)
`TQ1_0`	11.867404*	10.561187*
`TQ2_0`	11.867404*	4.892322*
`Q3_K`	0.186572	0.117357
`IQ4_NL`	0.027555	0.025608
`IQ4_XS`	0.028266	0.026315
`Q4_0`	0.047153	0.035182
`Q5_0`	0.012266	0.008703

*: TQ1_0 and TQ2_0 kl-divergence was calculated on the first 8 chunks.

Full KL-Divergence results

TQ1_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   : 1125139.681824 _ 53735.487676
Mean PPL(base)                :   7.829070 _   0.423561
Cor(ln(PPL(Q)), ln(PPL(base))):   3.79%
Mean ln(PPL(Q)/PPL(base))     :  11.875574 _   0.070795
Mean PPL(Q)/PPL(base)         : 143713.070798 _ 10174.096350
Mean PPL(Q)-PPL(base)         : 1125131.852754 _ 53735.471615

====== KL divergence statistics ======
Mean    KLD:  11.867404 _   0.049754
Maximum KLD:  19.151203
99.9%   KLD:  18.679686
99.0%   KLD:  17.530537
99.0%   KLD:  17.530537
Median  KLD:  11.604105
10.0%   KLD:   9.209702
 5.0%   KLD:   8.519671
 1.0%   KLD:   7.131413
Minimum KLD:   4.941659

====== Token probability statistics ======
Mean    _p: -41.737 _ 0.828 %
Maximum _p:  0.003%
99.9%   _p:  0.000%
99.0%   _p: -0.001%
95.0%   _p: -0.077%
90.0%   _p: -0.425%
75.0%   _p: -4.615%
Median  _p: -31.757%
25.0%   _p: -80.250%
10.0%   _p: -98.395%
 5.0%   _p: -99.696%
 1.0%   _p: -99.995%
 0.1%   _p: -100.000%
Minimum _p: -100.000%
RMS _p    : 56.043 _ 0.736 %
Same top p:  0.000 _ 0.000 %

TQ1_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   : 296852.133435 _ 14935.572530
Mean PPL(base)                :   7.829070 _   0.423561
Cor(ln(PPL(Q)), ln(PPL(base))):   3.35%
Mean ln(PPL(Q)/PPL(base))     :  10.543146 _   0.072637
Mean PPL(Q)/PPL(base)         : 37916.653690 _ 2754.158345
Mean PPL(Q)-PPL(base)         : 296844.304365 _ 14935.558362

====== KL divergence statistics ======
Mean    KLD:  10.561187 _   0.050435
Maximum KLD:  20.691839
99.9%   KLD:  19.943926
99.0%   KLD:  16.591721
99.0%   KLD:  16.591721
Median  KLD:  10.243422
10.0%   KLD:   7.961243
 5.0%   KLD:   7.444180
 1.0%   KLD:   6.488970
Minimum KLD:   5.301041

====== Token probability statistics ======
Mean    _p: -41.735 _ 0.828 %
Maximum _p:  0.019%
99.9%   _p:  0.002%
99.0%   _p: -0.001%
95.0%   _p: -0.076%
90.0%   _p: -0.417%
75.0%   _p: -4.614%
Median  _p: -31.754%
25.0%   _p: -80.250%
10.0%   _p: -98.394%
 5.0%   _p: -99.690%
 1.0%   _p: -99.995%
 0.1%   _p: -99.999%
Minimum _p: -100.000%
RMS _p    : 56.041 _ 0.736 %
Same top p:  0.000 _ 0.000 %

TQ2_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   : 1125139.681824 _ 53735.487676
Mean PPL(base)                :   7.829070 _   0.423561
Cor(ln(PPL(Q)), ln(PPL(base))):   3.79%
Mean ln(PPL(Q)/PPL(base))     :  11.875574 _   0.070795
Mean PPL(Q)/PPL(base)         : 143713.070798 _ 10174.096350
Mean PPL(Q)-PPL(base)         : 1125131.852754 _ 53735.471615

====== KL divergence statistics ======
Mean    KLD:  11.867404 _   0.049754
Maximum KLD:  19.151203
99.9%   KLD:  18.679686
99.0%   KLD:  17.530537
99.0%   KLD:  17.530537
Median  KLD:  11.604105
10.0%   KLD:   9.209702
 5.0%   KLD:   8.519671
 1.0%   KLD:   7.131413
Minimum KLD:   4.941659

====== Token probability statistics ======
Mean    _p: -41.737 _ 0.828 %
Maximum _p:  0.003%
99.9%   _p:  0.000%
99.0%   _p: -0.001%
95.0%   _p: -0.077%
90.0%   _p: -0.425%
75.0%   _p: -4.615%
Median  _p: -31.757%
25.0%   _p: -80.250%
10.0%   _p: -98.395%
 5.0%   _p: -99.696%
 1.0%   _p: -99.995%
 0.1%   _p: -100.000%
Minimum _p: -100.000%
RMS _p    : 56.043 _ 0.736 %
Same top p:  0.000 _ 0.000 %

TQ2_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   : 978.986590 _  77.723934
Mean PPL(base)                :   7.829070 _   0.423561
Cor(ln(PPL(Q)), ln(PPL(base))):  37.62%
Mean ln(PPL(Q)/PPL(base))     :   4.828674 _   0.077447
Mean PPL(Q)/PPL(base)         : 125.045069 _   9.684305
Mean PPL(Q)-PPL(base)         : 971.157520 _  77.565565

====== KL divergence statistics ======
Mean    KLD:   4.892322 _   0.060002
Maximum KLD:  17.154123
99.9%   KLD:  15.163409
99.0%   KLD:  12.966301
99.0%   KLD:  12.966301
Median  KLD:   4.245806
10.0%   KLD:   2.012889
 5.0%   KLD:   1.573402
 1.0%   KLD:   0.674623
Minimum KLD:   0.055967

====== Token probability statistics ======
Mean    _p: -38.095 _ 0.806 %
Maximum _p: 43.607%
99.9%   _p: 33.496%
99.0%   _p:  6.422%
95.0%   _p: -0.001%
90.0%   _p: -0.139%
75.0%   _p: -3.232%
Median  _p: -28.157%
25.0%   _p: -72.939%
10.0%   _p: -96.195%
 5.0%   _p: -99.309%
 1.0%   _p: -99.983%
 0.1%   _p: -99.997%
Minimum _p: -100.000%
RMS _p    : 52.701 _ 0.733 %
Same top p: 17.451 _ 0.841 %

Q3_K-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.878138 _   0.055859
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  96.23%
Mean ln(PPL(Q)/PPL(base))     :   0.192857 _   0.001741
Mean PPL(Q)/PPL(base)         :   1.212710 _   0.002112
Mean PPL(Q)-PPL(base)         :   1.557227 _   0.016749

====== KL divergence statistics ======
Mean    KLD:   0.186572 _   0.000790
Maximum KLD:   9.904108
99.9%   KLD:   3.526466
99.0%   KLD:   1.393469
99.0%   KLD:   1.393469
Median  KLD:   0.117172
10.0%   KLD:   0.008850
 5.0%   KLD:   0.002114
 1.0%   KLD:   0.000174
Minimum KLD:  -0.000001

====== Token probability statistics ======
Mean    _p: -5.648 _ 0.035 %
Maximum _p: 92.174%
99.9%   _p: 41.266%
99.0%   _p: 18.980%
95.0%   _p:  6.497%
90.0%   _p:  2.552%
75.0%   _p:  0.037%
Median  _p: -1.271%
25.0%   _p: -8.534%
10.0%   _p: -19.564%
 5.0%   _p: -29.581%
 1.0%   _p: -61.632%
 0.1%   _p: -87.320%
Minimum _p: -99.573%
RMS _p    : 14.263 _ 0.063 %
Same top p: 80.860 _ 0.104 %

Q3_K-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.286715 _   0.052922
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  97.53%
Mean ln(PPL(Q)/PPL(base))     :   0.123919 _   0.001419
Mean PPL(Q)/PPL(base)         :   1.131924 _   0.001606
Mean PPL(Q)-PPL(base)         :   0.965805 _   0.012684

====== KL divergence statistics ======
Mean    KLD:   0.117357 _   0.000591
Maximum KLD:   7.622900
99.9%   KLD:   2.906666
99.0%   KLD:   0.987876
99.0%   KLD:   0.987876
Median  KLD:   0.070448
10.0%   KLD:   0.005163
 5.0%   KLD:   0.001563
 1.0%   KLD:   0.000177
Minimum KLD:  -0.000002

====== Token probability statistics ======
Mean    _p: -3.178 _ 0.026 %
Maximum _p: 87.872%
99.9%   _p: 33.198%
99.0%   _p: 17.043%
95.0%   _p:  7.405%
90.0%   _p:  3.455%
75.0%   _p:  0.139%
Median  _p: -0.624%
25.0%   _p: -4.963%
10.0%   _p: -12.850%
 5.0%   _p: -19.425%
 1.0%   _p: -42.038%
 0.1%   _p: -82.079%
Minimum _p: -99.251%
RMS _p    : 10.316 _ 0.058 %
Same top p: 84.342 _ 0.096 %

IQ4_NL-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.538597 _   0.048332
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.42%
Mean ln(PPL(Q)/PPL(base))     :   0.029301 _   0.000690
Mean PPL(Q)/PPL(base)         :   1.029735 _   0.000711
Mean PPL(Q)-PPL(base)         :   0.217687 _   0.005375

====== KL divergence statistics ======
Mean    KLD:   0.027555 _   0.000180
Maximum KLD:   7.758258
99.9%   KLD:   0.872403
99.0%   KLD:   0.252240
99.0%   KLD:   0.252240
Median  KLD:   0.014799
10.0%   KLD:   0.000799
 5.0%   KLD:   0.000225
 1.0%   KLD:   0.000022
Minimum KLD:  -0.000017

====== Token probability statistics ======
Mean    _p: -0.618 _ 0.012 %
Maximum _p: 67.626%
99.9%   _p: 23.791%
99.0%   _p: 10.496%
95.0%   _p:  4.826%
90.0%   _p:  2.719%
75.0%   _p:  0.470%
Median  _p: -0.031%
25.0%   _p: -1.294%
10.0%   _p: -4.379%
 5.0%   _p: -7.194%
 1.0%   _p: -16.788%
 0.1%   _p: -45.069%
Minimum _p: -91.865%
RMS _p    :  4.736 _ 0.040 %
Same top p: 92.390 _ 0.070 %

IQ4_NL-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.536655 _   0.048350
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.47%
Mean ln(PPL(Q)/PPL(base))     :   0.029044 _   0.000659
Mean PPL(Q)/PPL(base)         :   1.029470 _   0.000678
Mean PPL(Q)-PPL(base)         :   0.215745 _   0.005157

====== KL divergence statistics ======
Mean    KLD:   0.025608 _   0.000165
Maximum KLD:   4.107863
99.9%   KLD:   0.798030
99.0%   KLD:   0.234651
99.0%   KLD:   0.234651
Median  KLD:   0.013462
10.0%   KLD:   0.000753
 5.0%   KLD:   0.000213
 1.0%   KLD:   0.000022
Minimum KLD:  -0.000112

====== Token probability statistics ======
Mean    _p: -0.594 _ 0.012 %
Maximum _p: 73.332%
99.9%   _p: 23.196%
99.0%   _p: 10.135%
95.0%   _p:  4.583%
90.0%   _p:  2.598%
75.0%   _p:  0.445%
Median  _p: -0.031%
25.0%   _p: -1.230%
10.0%   _p: -4.172%
 5.0%   _p: -6.832%
 1.0%   _p: -16.186%
 0.1%   _p: -43.544%
Minimum _p: -85.463%
RMS _p    :  4.563 _ 0.039 %
Same top p: 92.503 _ 0.069 %

IQ4_XS-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.545106 _   0.048341
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.40%
Mean ln(PPL(Q)/PPL(base))     :   0.030165 _   0.000699
Mean PPL(Q)/PPL(base)         :   1.030624 _   0.000720
Mean PPL(Q)-PPL(base)         :   0.224196 _   0.005443

====== KL divergence statistics ======
Mean    KLD:   0.028266 _   0.000181
Maximum KLD:   7.176061
99.9%   KLD:   0.859031
99.0%   KLD:   0.254570
99.0%   KLD:   0.254570
Median  KLD:   0.015240
10.0%   KLD:   0.000850
 5.0%   KLD:   0.000237
 1.0%   KLD:   0.000023
Minimum KLD:  -0.000006

====== Token probability statistics ======
Mean    _p: -0.694 _ 0.013 %
Maximum _p: 78.032%
99.9%   _p: 24.936%
99.0%   _p: 10.510%
95.0%   _p:  4.717%
90.0%   _p:  2.622%
75.0%   _p:  0.427%
Median  _p: -0.040%
25.0%   _p: -1.394%
10.0%   _p: -4.560%
 5.0%   _p: -7.405%
 1.0%   _p: -16.919%
 0.1%   _p: -44.989%
Minimum _p: -85.561%
RMS _p    :  4.809 _ 0.040 %
Same top p: 92.246 _ 0.071 %

IQ4_XS-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.544168 _   0.048389
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.46%
Mean ln(PPL(Q)/PPL(base))     :   0.030040 _   0.000668
Mean PPL(Q)/PPL(base)         :   1.030496 _   0.000688
Mean PPL(Q)-PPL(base)         :   0.223258 _   0.005236

====== KL divergence statistics ======
Mean    KLD:   0.026315 _   0.000166
Maximum KLD:   3.740370
99.9%   KLD:   0.830532
99.0%   KLD:   0.240454
99.0%   KLD:   0.240454
Median  KLD:   0.013954
10.0%   KLD:   0.000761
 5.0%   KLD:   0.000212
 1.0%   KLD:   0.000022
Minimum KLD:  -0.000062

====== Token probability statistics ======
Mean    _p: -0.626 _ 0.012 %
Maximum _p: 70.461%
99.9%   _p: 24.195%
99.0%   _p: 10.158%
95.0%   _p:  4.582%
90.0%   _p:  2.607%
75.0%   _p:  0.445%
Median  _p: -0.033%
25.0%   _p: -1.285%
10.0%   _p: -4.298%
 5.0%   _p: -7.011%
 1.0%   _p: -16.277%
 0.1%   _p: -43.126%
Minimum _p: -83.774%
RMS _p    :  4.614 _ 0.039 %
Same top p: 92.411 _ 0.070 %

Q4_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.702417 _   0.049312
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.02%
Mean ln(PPL(Q)/PPL(base))     :   0.050800 _   0.000893
Mean PPL(Q)/PPL(base)         :   1.052112 _   0.000939
Mean PPL(Q)-PPL(base)         :   0.381507 _   0.007193

====== KL divergence statistics ======
Mean    KLD:   0.047153 _   0.000267
Maximum KLD:   4.765501
99.9%   KLD:   1.301064
99.0%   KLD:   0.395570
99.0%   KLD:   0.395570
Median  KLD:   0.027134
10.0%   KLD:   0.001652
 5.0%   KLD:   0.000448
 1.0%   KLD:   0.000041
Minimum KLD:  -0.000002

====== Token probability statistics ======
Mean    _p: -1.191 _ 0.017 %
Maximum _p: 91.718%
99.9%   _p: 29.094%
99.0%   _p: 13.441%
95.0%   _p:  6.225%
90.0%   _p:  3.341%
75.0%   _p:  0.416%
Median  _p: -0.104%
25.0%   _p: -2.169%
10.0%   _p: -6.699%
 5.0%   _p: -10.818%
 1.0%   _p: -24.096%
 0.1%   _p: -55.775%
Minimum _p: -94.511%
RMS _p    :  6.421 _ 0.045 %
Same top p: 89.844 _ 0.080 %

Q4_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.590248 _   0.048445
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.29%
Mean ln(PPL(Q)/PPL(base))     :   0.036130 _   0.000763
Mean PPL(Q)/PPL(base)         :   1.036790 _   0.000791
Mean PPL(Q)-PPL(base)         :   0.269338 _   0.005948

====== KL divergence statistics ======
Mean    KLD:   0.035182 _   0.000211
Maximum KLD:   4.714078
99.9%   KLD:   1.033507
99.0%   KLD:   0.308332
99.0%   KLD:   0.308332
Median  KLD:   0.019653
10.0%   KLD:   0.001159
 5.0%   KLD:   0.000343
 1.0%   KLD:   0.000036
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    _p: -0.930 _ 0.014 %
Maximum _p: 69.304%
99.9%   _p: 26.912%
99.0%   _p: 11.498%
95.0%   _p:  5.215%
90.0%   _p:  2.828%
75.0%   _p:  0.377%
Median  _p: -0.086%
25.0%   _p: -1.771%
10.0%   _p: -5.487%
 5.0%   _p: -8.689%
 1.0%   _p: -19.295%
 0.1%   _p: -50.858%
Minimum _p: -95.224%
RMS _p    :  5.425 _ 0.043 %
Same top p: 91.293 _ 0.074 %

Q5_0-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.416280 _   0.047488
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.74%
Mean ln(PPL(Q)/PPL(base))     :   0.012943 _   0.000460
Mean PPL(Q)/PPL(base)         :   1.013027 _   0.000466
Mean PPL(Q)-PPL(base)         :   0.095370 _   0.003477

====== KL divergence statistics ======
Mean    KLD:   0.012266 _   0.000087
Maximum KLD:   3.248029
99.9%   KLD:   0.381984
99.0%   KLD:   0.110554
99.0%   KLD:   0.110554
Median  KLD:   0.006441
10.0%   KLD:   0.000348
 5.0%   KLD:   0.000094
 1.0%   KLD:   0.000009
Minimum KLD:  -0.000051

====== Token probability statistics ======
Mean    _p: -0.210 _ 0.008 %
Maximum _p: 75.916%
99.9%   _p: 18.954%
99.0%   _p:  8.006%
95.0%   _p:  3.567%
90.0%   _p:  2.061%
75.0%   _p:  0.426%
Median  _p: -0.005%
25.0%   _p: -0.714%
10.0%   _p: -2.630%
 5.0%   _p: -4.353%
 1.0%   _p: -10.002%
 0.1%   _p: -26.310%
Minimum _p: -80.227%
RMS _p    :  3.132 _ 0.031 %
Same top p: 94.762 _ 0.059 %

Q5_0-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.391950 _   0.047290
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.82%
Mean ln(PPL(Q)/PPL(base))     :   0.009657 _   0.000381
Mean PPL(Q)/PPL(base)         :   1.009704 _   0.000384
Mean PPL(Q)-PPL(base)         :   0.071040 _   0.002859

====== KL divergence statistics ======
Mean    KLD:   0.008703 _   0.000060
Maximum KLD:   2.886117
99.9%   KLD:   0.272428
99.0%   KLD:   0.076724
99.0%   KLD:   0.076724
Median  KLD:   0.004717
10.0%   KLD:   0.000249
 5.0%   KLD:   0.000069
 1.0%   KLD:   0.000006
Minimum KLD:  -0.000149

====== Token probability statistics ======
Mean    _p: -0.166 _ 0.007 %
Maximum _p: 53.648%
99.9%   _p: 15.285%
99.0%   _p:  6.746%
95.0%   _p:  3.092%
90.0%   _p:  1.804%
75.0%   _p:  0.377%
Median  _p: -0.006%
25.0%   _p: -0.592%
10.0%   _p: -2.245%
 5.0%   _p: -3.710%
 1.0%   _p: -8.310%
 0.1%   _p: -23.145%
Minimum _p: -64.174%
RMS _p    :  2.623 _ 0.024 %
Same top p: 95.505 _ 0.055 %

Without imatrix:

Type	KLD (master)	KLD (this PR)
`Q3_K`	0.191461	0.193263
`IQ4_NL`	0.037710	0.037567
`IQ4_XS`	0.038394	0.038241

Full KL-Divergence results

Q3_K-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.927287 _   0.056746
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  96.15%
Mean ln(PPL(Q)/PPL(base))     :   0.198378 _   0.001767
Mean PPL(Q)/PPL(base)         :   1.219423 _   0.002154
Mean PPL(Q)-PPL(base)         :   1.606377 _   0.017462

====== KL divergence statistics ======
Mean    KLD:   0.191461 _   0.000805
Maximum KLD:   9.484603
99.9%   KLD:   3.696935
99.0%   KLD:   1.398213
99.0%   KLD:   1.398213
Median  KLD:   0.124919
10.0%   KLD:   0.009131
 5.0%   KLD:   0.002250
 1.0%   KLD:   0.000194
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    _p: -5.396 _ 0.034 %
Maximum _p: 88.842%
99.9%   _p: 40.539%
99.0%   _p: 19.547%
95.0%   _p:  6.951%
90.0%   _p:  2.817%
75.0%   _p:  0.054%
Median  _p: -1.202%
25.0%   _p: -8.205%
10.0%   _p: -19.431%
 5.0%   _p: -29.321%
 1.0%   _p: -57.448%
 0.1%   _p: -89.273%
Minimum _p: -99.810%
RMS _p    : 13.895 _ 0.062 %
Same top p: 80.334 _ 0.105 %

Q3_K-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.987559 _   0.058774
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  95.75%
Mean ln(PPL(Q)/PPL(base))     :   0.205107 _   0.001889
Mean PPL(Q)/PPL(base)         :   1.227656 _   0.002319
Mean PPL(Q)-PPL(base)         :   1.666649 _   0.019465

====== KL divergence statistics ======
Mean    KLD:   0.193263 _   0.000882
Maximum KLD:  11.519859
99.9%   KLD:   4.020220
99.0%   KLD:   1.580015
99.0%   KLD:   1.580015
Median  KLD:   0.116118
10.0%   KLD:   0.007823
 5.0%   KLD:   0.002150
 1.0%   KLD:   0.000217
Minimum KLD:   0.000000

====== Token probability statistics ======
Mean    _p: -4.440 _ 0.032 %
Maximum _p: 93.184%
99.9%   _p: 40.685%
99.0%   _p: 19.916%
95.0%   _p:  8.086%
90.0%   _p:  3.622%
75.0%   _p:  0.134%
Median  _p: -0.831%
25.0%   _p: -6.775%
10.0%   _p: -17.060%
 5.0%   _p: -26.130%
 1.0%   _p: -54.728%
 0.1%   _p: -90.324%
Minimum _p: -99.945%
RMS _p    : 13.001 _ 0.064 %
Same top p: 80.314 _ 0.105 %

IQ4_NL-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.604010 _   0.048780
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.22%
Mean ln(PPL(Q)/PPL(base))     :   0.037941 _   0.000800
Mean PPL(Q)/PPL(base)         :   1.038670 _   0.000831
Mean PPL(Q)-PPL(base)         :   0.283100 _   0.006317

====== KL divergence statistics ======
Mean    KLD:   0.037710 _   0.000229
Maximum KLD:   9.056503
99.9%   KLD:   1.073483
99.0%   KLD:   0.334107
99.0%   KLD:   0.334107
Median  KLD:   0.021188
10.0%   KLD:   0.001104
 5.0%   KLD:   0.000302
 1.0%   KLD:   0.000032
Minimum KLD:  -0.000026

====== Token probability statistics ======
Mean    _p: -0.753 _ 0.014 %
Maximum _p: 73.986%
99.9%   _p: 26.712%
99.0%   _p: 12.281%
95.0%   _p:  5.693%
90.0%   _p:  3.227%
75.0%   _p:  0.558%
Median  _p: -0.030%
25.0%   _p: -1.560%
10.0%   _p: -5.313%
 5.0%   _p: -8.623%
 1.0%   _p: -19.775%
 0.1%   _p: -49.707%
Minimum _p: -83.098%
RMS _p    :  5.485 _ 0.042 %
Same top p: 90.958 _ 0.076 %

IQ4_NL-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.625626 _   0.049182
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.21%
Mean ln(PPL(Q)/PPL(base))     :   0.040780 _   0.000809
Mean PPL(Q)/PPL(base)         :   1.041623 _   0.000842
Mean PPL(Q)-PPL(base)         :   0.304716 _   0.006511

====== KL divergence statistics ======
Mean    KLD:   0.037567 _   0.000227
Maximum KLD:   6.656605
99.9%   KLD:   1.092462
99.0%   KLD:   0.323606
99.0%   KLD:   0.323606
Median  KLD:   0.020855
10.0%   KLD:   0.001036
 5.0%   KLD:   0.000278
 1.0%   KLD:   0.000028
Minimum KLD:  -0.000008

====== Token probability statistics ======
Mean    _p: -0.607 _ 0.014 %
Maximum _p: 71.683%
99.9%   _p: 27.452%
99.0%   _p: 12.558%
95.0%   _p:  5.918%
90.0%   _p:  3.448%
75.0%   _p:  0.669%
Median  _p: -0.016%
25.0%   _p: -1.376%
10.0%   _p: -5.000%
 5.0%   _p: -8.292%
 1.0%   _p: -19.475%
 0.1%   _p: -50.424%
Minimum _p: -90.716%
RMS _p    :  5.447 _ 0.042 %
Same top p: 90.974 _ 0.076 %

IQ4_XS-master:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.611501 _   0.048884
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.21%
Mean ln(PPL(Q)/PPL(base))     :   0.038926 _   0.000807
Mean PPL(Q)/PPL(base)         :   1.039693 _   0.000839
Mean PPL(Q)-PPL(base)         :   0.290591 _   0.006400

====== KL divergence statistics ======
Mean    KLD:   0.038394 _   0.000236
Maximum KLD:   8.650847
99.9%   KLD:   1.098786
99.0%   KLD:   0.336311
99.0%   KLD:   0.336311
Median  KLD:   0.021377
10.0%   KLD:   0.001100
 5.0%   KLD:   0.000306
 1.0%   KLD:   0.000031
Minimum KLD:  -0.000064

====== Token probability statistics ======
Mean    _p: -0.742 _ 0.015 %
Maximum _p: 75.288%
99.9%   _p: 26.855%
99.0%   _p: 12.347%
95.0%   _p:  5.722%
90.0%   _p:  3.255%
75.0%   _p:  0.570%
Median  _p: -0.029%
25.0%   _p: -1.521%
10.0%   _p: -5.266%
 5.0%   _p: -8.585%
 1.0%   _p: -20.023%
 0.1%   _p: -51.804%
Minimum _p: -95.110%
RMS _p    :  5.549 _ 0.043 %
Same top p: 90.836 _ 0.076 %

IQ4_XS-PR:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.629274 _   0.049191
Mean PPL(base)                :   7.320910 _   0.046698
Cor(ln(PPL(Q)), ln(PPL(base))):  99.20%
Mean ln(PPL(Q)/PPL(base))     :   0.041258 _   0.000813
Mean PPL(Q)/PPL(base)         :   1.042121 _   0.000847
Mean PPL(Q)-PPL(base)         :   0.308364 _   0.006548

====== KL divergence statistics ======
Mean    KLD:   0.038241 _   0.000230
Maximum KLD:   5.704974
99.9%   KLD:   1.097719
99.0%   KLD:   0.330595
99.0%   KLD:   0.330595
Median  KLD:   0.021184
10.0%   KLD:   0.001050
 5.0%   KLD:   0.000291
 1.0%   KLD:   0.000030
Minimum KLD:  -0.000033

====== Token probability statistics ======
Mean    _p: -0.641 _ 0.014 %
Maximum _p: 64.814%
99.9%   _p: 27.286%
99.0%   _p: 12.588%
95.0%   _p:  5.866%
90.0%   _p:  3.450%
75.0%   _p:  0.656%
Median  _p: -0.020%
25.0%   _p: -1.410%
10.0%   _p: -5.097%
 5.0%   _p: -8.463%
 1.0%   _p: -19.711%
 0.1%   _p: -49.873%
Minimum _p: -92.210%
RMS _p    :  5.496 _ 0.042 %
Same top p: 90.896 _ 0.076 %

The improvements are more apparent with imatrix, where this is a strict improvement.
Without imatrix, it's a bit less clear.

What changed in the algorithms?

There's a neat way to visualize rounding algorithms with equirectangular projections of their errors in a particular 3D space.

Here's an equirectangular projection from the algorithm used in Q4_0 (which uses integers between -8 and 7):

This plots the weighted cosine similarity between the quantized vectors and the full-precision vectors which correspond to each pixel of the projection.

Less error is more yellow, while more error is more blue.

$$similarity(w, x, q) = \frac{\sum{w_i \cdot x_i \cdot q_i}}{\sqrt{(\sum{w_i \cdot x_i \cdot x_i}) \cdot (\sum{w_i \cdot q_i \cdot q_i})}}$$

Unless otherwise noted, the projections I'm including here always use $w_i = 1$.

Note that this doesn't fully capture the behavior of more complex rounding algorithms at higher dimensions, since this fundamentally is a 3D view of the rounding space (which in practice is more like 16D, 32D, or even 256D), but it is enough to make some problems more easily identifiable.

Non-ideal rounding algorithms have discontinuities in their weighted cosine similarity plots

(for Q4_0, the bluer line is caused by how the max scale is handled since #729)

Algorithms used on `master`

Let's start with what is used in the types on master, so that we have some baseline to compare with.

`make_q3_quants`

This algorithm is used only with Q3_K when there is no imatrix provided.

It's a bit broken for some models, notably with Qwen2.5-Coder-3B-Instruct.

It doesn't seem quite right (this will become clearer later when more ideal algorithms are illustrated).

Notice how vectors with positive or negative maximums are handled completely differently.

In practice, the rounding weights it uses are the square of the vectors, which looks more like this:

`make_qx_quants`

This algorithm is used in a lot of types.

In this example it's used with [-8, 7] as the range of integers:

I did not replace all of its uses yet because in some places it's good enough (e.g. Q6_K).

`make_qp_quants`

This is almost like make_qx_quants, but assumes unsigned quantization (from 0 to nmax) with a positive scale.

That it only works with unsigned quantization makes visualizing this a bit more different, since only the positive quadrant can be explored.

Still, if we limit the viewing range the positive quadrant of a face of a cube, here's what it looks like:

Note that the top left corner is [1, 0, 0], while the bottom right corner is [1, 1, 1] in this cube face projection.

`quantize_row_iq4_nl_impl`

This is used in both IQ4_NL and IQ4_XS.

Notice how there are many discontinuities, although the error is mostly small.

Algorithms from this PR

The weighted vector rounding algorithms I'm introducing all share a similar theory.

It's possible to use a cumulative sum to enumerate all weighted dot products for each distinct initial scales.

This requires sorting the possible inverse scales so that each step changes only a single integer in the candidate quantized vector.

In practice, using a max-heap of the scales seems to be faster than using qsort, which is why I've added struct k_heap (which is basically a binary max-heap).

I've been exploring this idea in https://github.com/compilade/rounding-experiments, which is also where the equirectangular visualization script comes from (it's equirectangular.py in that repo).

I will be eventually publishing a more complete explanation of the algorithms, but there are still some unsolved problems like how to generalize this to offset quantization types like Q4_K (which loosely have the form q[i] * s - m).

If you'd like to help research into this kind of quantization algorithm, or help formalize this, reach out.

`make_qkxh_quants`

This is very similar to make_qx_quants, but it's doing a more exhaustive cumulative search instead of a grid search (it's still not quite fully exhaustive, but close).

It's general enough to be a replacement for both make_qx_quants and make_qp_quants, since it supports arbitrary min and max representable values, instead of assuming the negative side goes one further than the positive (in the case of make_qx_quants), or assuming the min is zero (for make_qp_quants).
It does assume zero is somewhere part of the representable range, though.

`make_qkxsh_quants`

This is almost the same as make_qkxh_quants, but it has a different behavior for some distributions of imatrix weights where the best sign for the scale is not the sign of the absolute max value.

For example, when the representable integer range is [-2, 7], and the the weights are [1, 8, 8] instead of [1, 1, 1], make_qkxh_quants shows some discontinuities at the boundaries where the max changes.

But make_qkxsh_quants doesn't have this problem:

In practice, though, it doesn't seem to impact the quality of the quantization that much, except for very asymmetric types.

This is used with TQ2_0 with imatrix, since it's quite asymmetric, because it can store {-1, 0, 1, 2}.

`make_qkxh_nl_quants`

A more exhaustive general non-linear quantization function (which can technically be used for more than just the IQ4_NL kvalues if other non-linear types are introduced).

There are some variants.

One doesn't assume the sign of the best scale.
This is the slowest, but highest quality, and is used when an imatrix file is provided.

Another one assumes the sign of the best scale should make the absolute max value have the same sign as the absolute max kvalue of the non-linear mapping.
This is used when no imatrix is provided, since it's faster than trying both signs.

Notes

I use "weighed vector rounding algorithms" to refer to rounding algorithms which minimize the weighted sum of squared errors, although I'm not sure if that's the right terminology.
This should help with more intuitively explaining the quantization algorithms, since the ones added here come from a geometric interpretation of the problem.
- This also means it could be a starting point to find better weights than qw[i] * (sigma2 + x[i] * x[i]), which I think may be interesting for @jukofyork
Since this improves the quality of some types (particularly Q3_K with imatrix), this may affect decisions regarding types chosen in mixes (@ddh0, @KerfuffleV2, @Nexesenex)
I think make_qkxh_nl_quants is general enough to be useful in @ikawrakow's other non-linear types too (IQ2_K, IQ3_K, etc., in https://github.com/ikawrakow/ik_llama.cpp), although since they use multiple lookup tables for some types instead of only one, it might be more complicated than for IQ4_NL (and need some modifications).

The algorithms based directly on qsort instead of a binary max-heap might be easier to understand, and were last in these lines from an older commit in this PR:

llama.cpp/ggml/src/ggml-quants.c

Lines 631 to 1107 in 0c9e442

    
           struct fraction { 
        
               // float frac; 
        
               float numer; 
        
               float denom; 
        
               int i; 
        
           }; 
        
           // comparator function for sorting fractions in make_qkxs_quants 
        
           static inline int compare_fractions_desc(const void * a, const void * b) { 
        
               const struct fraction * f_a = (const struct fraction *) a; 
        
               const struct fraction * f_b = (const struct fraction *) b; 
        
               float na = f_a->numer; 
        
               float da = f_a->denom; 
        
               float nb = f_b->numer; 
        
               float db = f_b->denom; 
        
               // Stable sort 
        
               // a - b sorts ascending, which means 
        
               // 1 swaps, -1 stays 
        
               if (da == db) { // equal denominators 
        
                   return (na == nb) ? ((a > b) ? 1 : -1) : (na < nb) ? 1 : -1; 
        
               } 
        
               if (na == nb) { // equal numerators 
        
                   return (da > db) ? 1 : -1; 
        
               } 
        
               float ab = na * db; 
        
               float ba = nb * da; 
        
               return (ab == ba) ? ((a > b) ? 1 : -1) : (ab < ba) ? 1 : -1; 
        
           } 
        
           // exhaustive search with cumulative sums 
        
           // Need Faux to have room for n*(max(abs(nmin), abs(nmax))) fractions 
        
           static float make_qkxs_quants(int n, int nmin, int nmax, const float * restrict x, const float * restrict weights, int8_t * restrict L, int8_t * restrict Laux, struct fraction * restrict Faux, bool signed_scale) { 
        
               float max = x[0]; 
        
               float min = x[0]; 
        
               float w_amax = weights[0] * fabsf(x[0]); 
        
               int max_i = 0; 
        
               int w_amax_i = 0; 
        
               int min_i = 0; 
        
               for (int i = 1; i < n; ++i) { 
        
                   if (x[i] < min) { min = x[i]; min_i = i; } 
        
                   if (x[i] > max) { max = x[i]; max_i = i; } 
        
                   // Find the most important value 
        
                   const float w = weights[i]; 
        
                   const float wax = w * fabsf(x[i]); 
        
                   if (wax > w_amax) { 
        
                       w_amax = wax; 
        
                       w_amax_i = i; 
        
                   } 
        
               } 
        
               const int amax_i = fabsf(min) > fabsf(max) ? min_i : max_i; 
        
               const float amax = fabsf(x[amax_i]); 
        
               if (amax < GROUP_MAX_EPS) { // all zero 
        
                   for (int i = 0; i < n; ++i) { 
        
                       L[i] = 0; 
        
                   } 
        
                   return 0.0f; 
        
               } 
        
               bool negative_scale = false; 
        
               if (signed_scale && -nmin != nmax) { 
        
                   // the max side should have the biggest range 
        
                   // FIXME: this is not always the best sign 
        
                   if ((x[amax_i] < 0.0f) == (-nmin < nmax)) { 
        
                       // [-4, 3] ==> [-3, 4] 
        
                       const int tmp = nmin; 
        
                       const float ftmp = min; 
        
                       nmin = -nmax; 
        
                       nmax = -tmp; 
        
                       min = -max; 
        
                       max = -ftmp; 
        
                       negative_scale = true; 
        
                   } 
        
               } 
        
               // Find the max range in [0, amax_range] which doesn't result in clamping. 
        
               // This is the range from the side which would clamp first (biggest ratio of max to nmax). 
        
               int amax_range; 
        
               float range_max; 
        
               if (fabsf(-max * nmin) < fabsf(-min * nmax)) { 
        
                   amax_range = MAX(0, -nmin); 
        
                   range_max = fabsf(min); 
        
               } else { 
        
                   amax_range = MAX(0, nmax); 
        
                   range_max = fabsf(max); 
        
               } 
        
               float sumlx = 0.0f; 
        
               float suml2 = 0.0f; 
        
               float scale = 0.0f; 
        
               float best = 0.0f; 
        
               float best_denom = 1.0f; 
        
               if (amax_range > 1) { 
        
                   // The smallest non-redundant iscale makes the first clamped value half+1 its max integer value. 
        
                   // Proof: anything smaller has a representable vector with values twice as big. 
        
                   const float iscale = ((float)(amax_range / 2 + 1))/range_max * (negative_scale ? -1.0f : 1.0f); 
        
                   for (int i = 0; i < n; ++i) { 
        
                       const float w = weights[i]; 
        
                       int l = MAX(nmin, MIN(lroundf(x[i] * iscale), nmax)); 
        
                       if (negative_scale) { l = -l; } 
        
                       Laux[i] = l; 
        
                       L[i] = l; 
        
                       suml2 += w * l * l; 
        
                       sumlx += w * l * x[i]; 
        
                   } 
        
                   best = sumlx * sumlx; 
        
                   best_denom = suml2; // should never be zero 
        
                   scale = sumlx / suml2; 
        
               } else { 
        
                   for (int i = 0; i < n; ++i) { 
        
                       Laux[i] = 0; 
        
                       L[i] = 0; 
        
                   } 
        
               } 
        
               const int imax_range = MAX(0, (x[w_amax_i] < 0.0f) ? -nmin : nmax); 
        
               const int max_odd = 2*(imax_range + 1) + 1; 
        
               const float wmax = fabsf(x[w_amax_i]); 
        
               int n_frac = 0; 
        
               for (int i = 0; i < n; ++i) { 
        
                   // assuming nmin <= nmax 
        
                   const int odd_max = MAX(abs(Laux[i]), x[i] < 0.0f ? -nmin : nmax); 
        
                   const int odd_min = MAX(abs(Laux[i]), x[i] < 0.0f ? -nmax : nmin); 
        
                   const float v = fabsf(x[i]); 
        
                   const float v_max_odd = v * max_odd; 
        
                   for (int j = odd_min; j < odd_max; ++j) { 
        
                       const float odd = 2*j + 1; 
        
                       if (wmax * odd < v_max_odd) { 
        
                           Faux[n_frac++] = (struct fraction){ 
        
                               .numer=v, 
        
                               .denom=odd, 
        
                               .i=i, 
        
                           }; 
        
                       } else { 
        
                           // stop when the inverse scale would result in clamping the most important value 
        
                           break; 
        
                       } 
        
                   } 
        
               } 
        
               qsort(Faux, n_frac, sizeof(struct fraction), compare_fractions_desc); 
        
               int best_p_i = -1; // consecutive with 0..n_frac 
        
               for (int i = 0; i < n_frac; ++i) { 
        
                   // maximize the weighted cosine 
        
                   const int ii = Faux[i].i; 
        
                   const float w = weights ? weights[ii] : x[ii] * x[ii]; 
        
                   sumlx += w * Faux[i].numer; 
        
                   suml2 += w * Faux[i].denom; 
        
                   const float current = sumlx * sumlx; 
        
                   Laux[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                   if (suml2 > 0.0f && Faux[i].numer > 0.0f && current * best_denom > best * suml2) { 
        
                       best = current; 
        
                       best_denom = suml2; 
        
                       scale = sumlx / suml2; 
        
                       if (i == best_p_i + 1) { 
        
                           // reduce copies for consecutive bests 
        
                           L[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                       } else { 
        
                           for (int j = 0; j < n; ++j) { 
        
                               L[j] = Laux[j]; 
        
                           } 
        
                       } 
        
                       best_p_i = i; 
        
                   } 
        
               } 
        
               for (int i = 0; i < n; ++i) { 
        
                   L[i] = negative_scale ? (-L[i] + nmax) : (L[i] + -nmin); 
        
                   GGML_ASSERT(L[i] >= 0 && L[i] <= nmax - nmin); 
        
               } 
        
               return negative_scale ? -scale : scale; 
        
           } 
        
           // Very similar to make_qkxs_quants, but the sign of the scale is not assumed to be the sign of the absmax value. 
        
           static float make_qkxss_quants(int n, int nmin, int nmax, const float * restrict x, const float * restrict weights, int8_t * restrict L, int8_t * restrict Laux, struct fraction * restrict Faux) { 
        
               // start at zero 
        
               nmin = MIN(0, nmin); 
        
               nmax = MAX(0, nmax); 
        
               float amax = 0.0f; 
        
               float min = 0.0f; 
        
               float max = 0.0f; 
        
               float w_amax = 0.0f; 
        
               int amax_i = -1; 
        
               int w_amax_i = -1; 
        
               for (int i = 0; i < n; ++i) { 
        
                   const float w = weights ? weights[i] : x[i] * x[i]; 
        
                   const float ax = fabsf(x[i]); 
        
                   const float wax = w * ax; 
        
                   if (ax > amax) { amax = ax; amax_i = i; } 
        
                   if (x[i] > max) { max = x[i]; } 
        
                   if (x[i] < min) { min = x[i]; } 
        
                   // Find the most important value 
        
                   if (wax > w_amax) { w_amax = wax; w_amax_i = i; } 
        
               } 
        
               if (amax < GROUP_MAX_EPS || amax_i < 0 || w_amax_i < 0) { // all zero 
        
                   for (int i = 0; i < n; ++i) { L[i] = 0; } 
        
                   return 0.0f; 
        
               } 
        
               // Use the side which will clamp first. 
        
               // The first clamped value is the absmax at the end of the common range. 
        
               // TODO: reduce the search space when one of the ranges is 0 
        
               const int amax_range = MIN(-nmin, nmax); 
        
               float sumlx_p = 0.0f; 
        
               float suml2_p = 0.0f; 
        
               float sumlx_n = 0.0f; 
        
               float suml2_n = 0.0f; 
        
               float scale = 0.0f; 
        
               float best = 0.0f; 
        
               float best_denom = 1.0f; 
        
               int best_i = -2; // not consecutive with 0..n_frac 
        
               // Pre-calculate the half-point for the common range. 
        
               // All smaller vectors have a representable vector with twice the values, and thus can be skipped. 
        
               if (amax_range > 1) { 
        
                   const float iscale = ((float)(amax_range / 2 + 1))/amax; 
        
                   for (int i = 0; i < n; ++i) { 
        
                       const float w = weights ? weights[i] : x[i] * x[i]; 
        
                       int l = MAX(nmin, MIN(lroundf(x[i] * iscale), nmax)); 
        
                       Laux[i] = l; 
        
                       suml2_p += w * l * l; 
        
                       sumlx_p += w * l * x[i]; 
        
                   } 
        
                   sumlx_n = -sumlx_p; 
        
                   suml2_n = suml2_p; 
        
                   const float current_p = sumlx_p * sumlx_p; 
        
                   if (suml2_p > 0.0f && current_p * best_denom > best * suml2_p) { 
        
                       best = current_p; 
        
                       best_denom = suml2_p; 
        
                       scale = sumlx_p / suml2_p; 
        
                       for (int i = 0; i < n; ++i) { 
        
                           L[i] = Laux[i]; 
        
                       } 
        
                       best_i = -1; // right before 0 of the loop after sorting 
        
                   } 
        
               } else { 
        
                   for (int i = 0; i < n; ++i) { 
        
                       Laux[i] = 0; 
        
                   } 
        
               } 
        
               const int imax_range = MAX(nmax, -nmin); 
        
               const int max_odd = 2*(imax_range + 1) + 1; 
        
               const float wmax = fabsf(x[w_amax_i]); 
        
               int n_frac = 0; 
        
               for (int i = 0; i < n; ++i) { 
        
                   // assuming nmin <= nmax 
        
                   const int odd_max = MAX(nmax, -nmin); 
        
                   const float v = fabsf(x[i]); 
        
                   const float v_max_odd = v * max_odd; 
        
                   for (int j = abs(Laux[i]); j < odd_max; ++j) { 
        
                       const float odd = 2*j + 1; 
        
                       const float wmax_odd = wmax * odd; 
        
                       if (wmax_odd < v_max_odd) { 
        
                           Faux[n_frac++] = (struct fraction){ 
        
                               .numer=v, 
        
                               .denom=odd, 
        
                               .i=i, 
        
                           }; 
        
                       } else { 
        
                           // stop when the inverse scale would result in clamping the most important value 
        
                           break; 
        
                       } 
        
                   } 
        
               } 
        
               qsort(Faux, n_frac, sizeof(struct fraction), compare_fractions_desc); 
        
               const float max_common_odd = (MIN(nmax, -nmin) * 2) + 1; 
        
               const float max_odd_p = (nmax * 2) + 1; 
        
               const float max_odd_n = (-nmin * 2) + 1; 
        
               for (int i = 0; i < n_frac; ++i) { 
        
                   // maximize the weighted cosine similarity 
        
                   const int ii = Faux[i].i; 
        
                   const float w = weights ? weights[ii] : x[ii] * x[ii]; 
        
                   const float lx = w * Faux[i].numer; 
        
                   const float odd = Faux[i].denom; 
        
                   const float l2 = w * odd; 
        
                   Laux[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                   float sumlx = 0.0f; 
        
                   float proj = 0.0f; 
        
                   float norm = 0.0f; 
        
                   if (odd < max_common_odd) { 
        
                       sumlx_p += lx; 
        
                       suml2_p += l2; 
        
                       sumlx_n -= lx; 
        
                       suml2_n += l2; 
        
                       sumlx = sumlx_p; 
        
                       proj = sumlx_p * sumlx_p; 
        
                       norm = suml2_p; 
        
                       // avoid double-copying Laux in a single iteration 
        
                       if (suml2_p != suml2_n && suml2_p * suml2_n > 0.0f) { 
        
                           const float proj_n = sumlx_n * sumlx_n; 
        
                           if (proj_n * norm > proj * suml2_n) { 
        
                               sumlx = sumlx_n; 
        
                               proj = proj_n; 
        
                               norm = suml2_n; 
        
                           } 
        
                       } 
        
                   } else if (x[ii] < 0.0f ? odd < max_odd_n : odd < max_odd_p) { 
        
                       sumlx_p += lx; 
        
                       suml2_p += l2; 
        
                       sumlx = sumlx_p; 
        
                       proj = sumlx_p * sumlx_p; 
        
                       norm = suml2_p; 
        
                   } else { 
        
                       // outside the positive range means we're now into negatives 
        
                       sumlx_n -= lx; 
        
                       suml2_n += l2; 
        
                       sumlx = sumlx_n; 
        
                       proj = sumlx_n * sumlx_n; 
        
                       norm = suml2_n; 
        
                   } 
        
                   if (norm > 0.0f && proj * best_denom > best * norm) { 
        
                       best = proj; 
        
                       best_denom = norm; 
        
                       scale = sumlx / norm; 
        
                       if (i == best_i + 1) { 
        
                           // reduce copies for consecutive bests 
        
                           L[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                       } else { 
        
                           for (int j = 0; j < n; ++j) { 
        
                               L[j] = Laux[j]; 
        
                           } 
        
                       } 
        
                       best_i = i; 
        
                   } 
        
               } 
        
               if (scale < 0.0f) { 
        
                   for (int i = 0; i < n; ++i) { 
        
                       L[i] = MAX(nmin, MIN(-L[i], nmax)) - nmin; 
        
                   } 
        
               } else { 
        
                   for (int i = 0; i < n; ++i) { 
        
                       L[i] = MAX(nmin, MIN(L[i], nmax)) - nmin; 
        
                   } 
        
               } 
        
               return scale; 
        
           } 
        
           // non-linear exhaustive search with cumulative sums 
        
           // Need Faux to have room for n*k fractions 
        
           static float make_qkxs_nl_quants(int n, int k, const float * restrict x, const float * restrict weights, const int8_t * restrict kvalues, uint8_t * restrict L, uint8_t * restrict Laux, struct fraction * restrict Faux, bool signed_scale) { 
        
               float sumlx = 0.0f; 
        
               float suml2 = 0.0f; 
        
               int kmin = abs(kvalues[0]); 
        
               int koff = 0; 
        
               for (int i = 1; i < k; ++i) { 
        
                   int ak = abs(kvalues[i]); 
        
                   if (ak < kmin) { 
        
                       kmin = ak; 
        
                       koff = i; 
        
                   } 
        
               } 
        
               kmin = kvalues[koff]; 
        
               for (int i = 0; i < n; ++i) { 
        
                   float w = weights ? weights[i] : x[i] * x[i]; 
        
                   Laux[i] = koff; 
        
                   sumlx += w * x[i] * kmin; 
        
                   suml2 += w * kmin * kmin; 
        
               } 
        
               int n_frac_p = 0; 
        
               for (int i = 0; i < n; ++i) { 
        
                   const int start = x[i] < 0.0f ? 1 : koff + 1; 
        
                   const int end = x[i] < 0.0f ? koff + 1: k; 
        
                   for (int j = start; j < end; ++j) { 
        
                       const float threshold = kvalues[j] + kvalues[j - 1]; 
        
                       const float step = kvalues[j] - kvalues[j - 1]; 
        
                       Faux[n_frac_p++] = (struct fraction){ 
        
                           // This should always be positive or else 
        
                           // the fraction comparison function won't work properly 
        
                           .numer=fabsf(x[i] * step), 
        
                           // It's amazing how this is still the difference of consecutive squares 
        
                           .denom=fabsf(threshold * step), 
        
                           .i=i, 
        
                       }; 
        
                   } 
        
               } 
        
               qsort(Faux, n_frac_p, sizeof(struct fraction), compare_fractions_desc); 
        
               float best = 0.0f; 
        
               float best_sumlx = 0.0f; 
        
               float best_suml2 = 1.0f; 
        
               float sumlx_p = sumlx; 
        
               float suml2_p = suml2; 
        
               int best_p_i = -2; // not consecutive with 0..n_frac 
        
               for (int i = 0; i < n_frac_p; ++i) { 
        
                   const int ii = Faux[i].i; 
        
                   const float w = weights ? weights[ii] : x[ii] * x[ii]; 
        
                   sumlx_p += w * Faux[i].numer; 
        
                   suml2_p += w * Faux[i].denom; 
        
                   const float current = sumlx_p * sumlx_p; 
        
                   Laux[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                   if (suml2_p > 0.0f && current * best_suml2 > best * suml2_p) { 
        
                       best = current; 
        
                       best_sumlx = sumlx_p; 
        
                       best_suml2 = suml2_p; 
        
                       if (i == best_p_i + 1) { 
        
                           // reduce copies for consecutive bests 
        
                           L[ii] += x[ii] < 0.0f ? -1 : 1; 
        
                       } else { 
        
                           for (int j = 0; j < n; ++j) { 
        
                               L[j] = Laux[j]; 
        
                           } 
        
                       } 
        
                       best_p_i = i; 
        
                   } 
        
               } 
        
               // Non-linear mappings are usually not symmetric, so try negating the scale 
        
               // This is the same as above, but keeping the old best if the new best is not better. 
        
               if (signed_scale) { 
        
                   for (int i = 0; i < n; ++i) { 
        
                       Laux[i] = koff; 
        
                   } 
        
                   int n_frac_n = 0; 
        
                   for (int i = 0; i < n; ++i) { 
        
                       const int start = x[i] >= 0.0f ? 1 : koff + 1; 
        
                       const int end = x[i] >= 0.0f ? koff + 1: k; 
        
                       for (int j = start; j < end; ++j) { 
        
                           const float threshold = kvalues[j] + kvalues[j - 1]; 
        
                           const float step = kvalues[j] - kvalues[j - 1]; 
        
                           Faux[n_frac_n++] = (struct fraction){ 
        
                               // This should always be positive or else 
        
                               // the fraction comparison function won't work properly 
        
                               .numer=fabsf(x[i] * step), 
        
                               // It's amazing how this is still the difference of consecutive squares 
        
                               .denom=fabsf(threshold * step), 
        
                               .i=i, 
        
                           }; 
        
                       } 
        
                   } 
        
                   qsort(Faux, n_frac_n, sizeof(struct fraction), compare_fractions_desc); 
        
                   float sumlx_n = -sumlx; 
        
                   float suml2_n = suml2; 
        
                   int best_n_i = -2; // not consecutive with 0..n_frac 
        
                   for (int i = 0; i < n_frac_n; ++i) { 
        
                       const int ii = Faux[i].i; 
        
                       const float w = weights ? weights[ii] : x[ii] * x[ii]; 
        
                       sumlx_n += w * Faux[i].numer; 
        
                       suml2_n += w * Faux[i].denom; 
        
                       const float current = sumlx_n * sumlx_n; 
        
                       Laux[ii] += x[ii] >= 0.0f ? -1 : 1; 
        
                       if (suml2_n > 0.0f && current * best_suml2 > best * suml2_n) { 
        
                           best = current; 
        
                           best_sumlx = -sumlx_n; 
        
                           best_suml2 = suml2_n; 
        
                           if (i == best_n_i + 1) { 
        
                               // reduce copies for consecutive bests 
        
                               L[ii] += x[ii] >= 0.0f ? -1 : 1; 
        
                           } else { 
        
                               for (int j = 0; j < n; ++j) { 
        
                                   L[j] = Laux[j]; 
        
                               } 
        
                           } 
        
                           best_n_i = i; 
        
                       } 
        
                   } 
        
               } 
        
               return best_suml2 != 0.0f ? best_sumlx / best_suml2 : 0.0f; 
        
           }

TODO in future PRs

Replace more uses of make_qx_quants and make_qp_quants with make_qkxh_quants
- And test for regression
Find how to generalize this to i-quants constrained to particular grids (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S)
Find an improved replacement for make_qkx3_quants (q[i] * s - m quants)
Use better weights than qw[i] * (sigma2 + x[i] * x[i]) if possible
Further optimize the speed of the cumulative search quantization algorithms
Use TQ2_0 in a quant mix (it's near IQ1_S quality-wise, but faster)
- Will be more relevant after ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183 and Metal TQ2_0 #12485

Make sure to read the contributing guidelines before submitting a PR

Slightly faster than the previous method.

Weirdly, it seems like in practice replacing this instance is not better. This is probably because of its interaction with make_qkx3_quants.

compilade · 2025-03-25T02:48:32Z

There seem to be some problems with both Metal and Vulkan tests when copying F32 to IQ4_NL.

Metal: https://github.com/ggml-org/llama.cpp/actions/runs/14049707576/job/39337674282?pr=12557#step:6:14829

CPY(type_src=f32,type_dst=iq4_nl,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): [CPY] NMSE = 0.006122690 > 0.000001000 FAIL
CPY(type_src=f32,type_dst=iq4_nl,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): [CPY] NMSE = 0.006376171 > 0.000001000 FAIL

Vulkan: https://github.com/ggml-org/llama.cpp/actions/runs/14049707576/job/39337674316?pr=12557#step:6:13695

CPY(type_src=f32,type_dst=iq4_nl,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): [CPY] NMSE = 0.005852800 > 0.000001000 FAIL
CPY(type_src=f32,type_dst=iq4_nl,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): [CPY] NMSE = 0.006000842 > 0.000001000 FAIL

I think this may be caused by the new quantization algorithm for IQ4_NL on CPU being more exhaustive than the approximation which the Metal and Vulkan ops do (since they directly use the absmax and the max kvalue for the scale, likely for quantization speed, instead of searching for the one which minimizes the error with the full-precision vector the most).

I'm not sure how to fix this other than making the CPU quantization for IQ4_NL worse (but faster) when used with GGML_OP_CPY (by making the from_float field in the type_traits_cpu entry for IQ4_NL point to either the previous implementation or the same fast shortcut-taking algorithm as the ones in the Metal, Vulkan and CUDA backends).

jukofyork · 2025-03-25T11:12:49Z

This looks really interesting and will read though it this week if I get time - I'm still keen to see if we can find a better way to regularise the weights.

You might find this interesting:

#5590

It all comes down to this set of 16 magic values:
static const int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
Where do they come from? I had implemented a K-means clustering based quantization in my private repository (similar to what, e.g., SqeezeLLM does), with clustering done per tensor row.

https://www.wolframalpha.com/input?i=plot+%7B127%2C+-104%2C+-83%2C+-65%2C+-49%2C+-35%2C+-22%2C+-10%2C+1%2C+13%2C+25%2C+38%2C+53%2C+69%2C+89%2C+113%7D

This bothered me as I couldn't see any good reason why it shouldn't be (after recentering) an odd function of the form f(-x) = -f(x) nor why it shouldn't map to the full [-128, 128] (or [-127, 128]) range, so after some experimentation, came up with this:

https://www.wolframalpha.com/input?i=plot+%281%2F4+-+8%2F64%29*%28x-8%29%5E3+%2B+8*%28x-8%29+between+0+and+16

It's not as clear in this form, but it actually parameterises a whole family of odd functions:

$$\left(\frac{1}{4} - \frac{a}{64}\right)x^3 + ax, \quad \text{where } a \in (0, 16)$$

https://www.wolframalpha.com/input?i=plot+16x+and+%281%2F4+-+8%2F64%29*x%5E3+%2B+8*x+and+%281%2F4+-+12%2F64%29*x%5E3+%2B+12*x+between+-8+and+8

(which may be useful to make an extension to the IQ4_NL quant type that allows the "steepness" parameter a to be set in the quant - it would just be a different lookup table for each a so don't think it would have much extra overhead compared to the stock IQ4_NL quant type - @ikawrakow you might find this interesting too! It could also open up the possibility of non-4bit versions of this quant...)

With some more manipulation you get:

$$\left(\frac{1}{4} - \frac{a}{64}\right)(16x-8)^3 + a(16x-8), \quad \text{where } a \in (0, 16)$$

https://www.wolframalpha.com/input?i=plot+%281%2F4+-+8%2F64%29*%2816x-8%29%5E3+%2B+8*%2816x-8%29+between+0+and+1

Which IIRC is related to a well known approximation to the symmetric beta quantile function (inverse CDF) and has been discussed on John D. Cook's blog before:

https://www.johndcook.com/blog/

(which is sadly so badly organised it's near impossible to find again lol)

Anyway, thought it might be interesting since you are looking at the rounding - it may be that it now comes out as an odd function if you were to rerun the k-means clustering on the fixed rounding?

jukofyork · 2025-03-25T12:19:11Z

There's a neat way to visualize rounding algorithms with equirectangular projections of their errors in a particular 3D space.

Here's an equirectangular projection from the algorithm used in Q4_0 (which uses integers between -8 and 7):

This plots the weighted cosine similarity between the quantized vectors and the full-precision vectors which correspond to each pixel of the projection.

Less error is more yellow, while more error is more blue.

s i m i l a r i t y ( w , x , q ) = ∑ w i ⋅ x i ⋅ q i ( ∑ w i ⋅ x i ⋅ x i ) ⋅ ( ∑ w i ⋅ q i ⋅ q i )

Unless otherwise noted, the projections I'm including here always use w i = 1 .

Can you explain how you created the plots in more detail?

My father was a geography teacher and map projections were one of his favourite topics, but I still can't 100% see the relationship here! :)

Are the "maximums" you are referring to the edges of the cube? I can see we could create a 2D heightmap of q, x, and colour using q ⋅ x (or cosine similarly if you normalise), but I can't see how you have created a 3D plot?

This is really fascinating BTW!

compilade · 2025-03-25T12:51:36Z

Can you explain how you created the plots in more detail?

@jukofyork
Sure, I've made them using this script https://github.com/compilade/rounding-experiments/blob/main/equirectangular.py and by modifying which rounding function is used (the rounding functions are defined in rounding-impl.c and imported in Python through rounding_c.py in that same repository).

The weighted cosine similarly only affects the color gradient of the plot.

jukofyork · 2025-03-25T13:14:51Z

This also means it could be a starting point to find better weights than qw[i] * (sigma2 + x[i] * x[i]), which I think may be interesting for @jukofyork

@ikawrakow kindly explained where this came from here:

ikawrakow/ik_llama.cpp#140

The reason this bothers me so much is because the formula doesn't act as a regulariser at the two extremes:

With no imatrix data the formula should revert to the "equal weights" prior.
In the limit (ie: imatrix data created from the full pre-training data used for the model or similar) there should be no contribution from the "equal weights" prior.

Then if you look at the experts in a MoE model, we should be weighing more or less towards the prior depending on the relative sample sizes, and so on.

Or put another way: there should be a tunable lambda parameter somewhere.

There are a multitude of different ways you can estimate the optimal lambda, ranging from really cheap approximations to super expensive methods using rollouts. See the work of:

or the textbooks by James E. Gentle, for an overview of this.

I'm going to dip out now as like I said in the other thread; I've nothing to gain from this and may have come across badly, which certainly wasn't my intention! :) I think the work @ikawrakow did on the quants in llama.cpp is nothing short of a miracle and IMO we'd all be much worse off without his input!

ggerganov · 2025-03-26T13:40:20Z

I think this may be caused by the new quantization algorithm for IQ4_NL on CPU being more exhaustive than the approximation which the Metal and Vulkan ops do (since they directly use the absmax and the max kvalue for the scale, likely for quantization speed, instead of searching for the one which minimizes the error with the full-precision vector the most).

I'm not sure how to fix this other than making the CPU quantization for IQ4_NL worse (but faster) when used with GGML_OP_CPY (by making the from_float field in the type_traits_cpu entry for IQ4_NL point to either the previous implementation or the same fast shortcut-taking algorithm as the ones in the Metal, Vulkan and CUDA backends).

I think that the CPY operations that involve quantization of the source data should remain simple because these are difficult to implement efficiently on the GPU and other devices. So using the fast shorcut-taking implementation during copy should be the better options here.

ggerganov · 2025-03-27T06:49:21Z

I did a quick perplexity test with a base Gemma 3 4B and observe improvement for Q4_0:

No imatrix:       PPL = 7.8525 +/- 0.05034
imatrix (master): PPL = 7.7259 +/- 0.04941
imatrix (PR):     PPL = 7.6888 +/- 0.04914

Though I agree that KLD is a better metric to track, especially for tuned models. I think after we resolve the failing tests, we can proceed to merge.

Great work on this @compilade!

schmorp · 2025-03-30T00:45:57Z

ikawrakow had some extensive comments on this at ikawrakow/ik_llama.cpp#288 (comment)

for example, he points out that the IQ4_NL changes make it 5x slower, without apparent benefit.

compilade · 2025-03-30T01:00:01Z

ikawrakow had some extensive comments on this at ikawrakow/ik_llama.cpp#288 (comment)

for example, he points out that the IQ4_NL changes make it 5x slower, without apparent benefit.

@schmorp

I'm aware that my initial approach is too slow and too exhaustive, and I'm working on making it faster by reducing the range of the cumulative search. I'll run more tests before pushing here, but in the meantime a faster version is available in https://github.com/compilade/llama.cpp/tree/compilade/optimal-rounding. Not sure yet how it affects perplexity and KL-divergence, which I will test soon-ish (and I'll also update the equirectangular plots).

I'm changing this to a draft until I make the proper changes here (and also those related to GGML_OP_CPY with IQ4_NL on CPU).

ggml/src/ggml-quants.c

jukofyork · 2025-04-04T12:09:41Z

There's been quite a few posts about the new QAT method, but this one today gave me an idea related to your pictures:

https://old.reddit.com/r/LocalLLaMA/comments/1jr8sw0/psa_you_can_do_qat_quantization_aware_tuning_with/

That would explain why they only released Q4_0 QAT GGUFs - it's compatible, and additional work would've been required for the llama.cpp K quants.

I must admit I still don't fully get what the pictures are showing, but I do wonder if the calculation used to generate them could actually be used to create a custom regularisation function which could be added to the loss to drive the weights towards the bin centres of a chosen K-quant (or legacy quant).

Sadly the Wikipedia page explains this horribly, but it's actually not hard at all to drive weights to different values than zero: sometimes you can do it via transformations (ie: for log-normal scale priors), but also directly by changing the gradient formula to take the difference from your chosen value instead of zero.

I'm not 100% sure it would work as each valley in your pictures may create a very hard to escape local minima (ie: a bit like trying to fit points to a sine wave), but you could solve this using other means (like annealing the lambda or random restarts).

Can you adapt your calculation to give a level of "K_X-quant-ish-ness", with zero being perfect 1:1 mapping between the real-valued values and the final K_X-quant that will result?

compilade · 2025-04-04T15:01:10Z

Can you adapt your calculation to give a level of "K_X-quant-ish-ness", with zero being perfect 1:1 mapping between the real-valued values and the final K_X-quant that will result?

@jukofyork

Yes! This is actually pretty much what is done to make the plots, except it's weighted cosine similarly instead of weighted squared error (but both are related).

Some thing to note though: I did not yet generalize this to the K-quants which have a minimum (Q2_K, Q4_K, Q5_K) which can offset the representable values.

I'm currently experimenting with a much faster sorting algorithm (since the scale-sorting step is the main bottleneck in these cumulative search algorithms), but once I'm done with that, I'll try generalizing to offset quantization.

I'm not 100% sure it would work as each valley in your pictures may create a very hard to escape local minima (ie: a bit like trying to fit points to a sine wave), but you could solve this using other means

Hmm, I wonder if this being weighed rounding could help with some of this. Finding the "importance" of the channels while training may or may not be easy, though, and may not help at all. In that case, neutral weights of 1 can be used.

I think this would not be more prone to being stuck than the other QAT error functions, but I could be wrong.

I'll see what is required for a PyTorch module exposing these quantization algorithms to help with testing this idea. Currently it does kind of work with Numpy in https://github.com/compilade/rounding-experiments, but the functions are not very ergonomic since the purpose of the bindings was mostly to simplify plotting the errors and not much else.

jukofyork · 2025-04-04T16:05:33Z

I'm currently experimenting with a much faster sorting algorithm (since the scale-sorting step is the main bottleneck in these cumulative search algorithms), but once I'm done with that, I'll try generalizing to offset quantization.

Are you sorting a fixed set of 32 values?

If so then have you heard of sorting networks:

https://en.m.wikipedia.org/wiki/Sorting_network

https://bertdobbelaere.github.io/sorting_networks.html

You can turn this into a C macro:

[(0,1),(2,3),(4,5),(6,7),(8,9),(10,11),(12,13),(14,15),(16,17),(18,19),(20,21),(22,23),(24,25),(26,27),(28,29),(30,31)]
[(0,2),(1,3),(4,6),(5,7),(8,10),(9,11),(12,14),(13,15),(16,18),(17,19),(20,22),(21,23),(24,26),(25,27),(28,30),(29,31)]
[(0,4),(1,5),(2,6),(3,7),(8,12),(9,13),(10,14),(11,15),(16,20),(17,21),(18,22),(19,23),(24,28),(25,29),(26,30),(27,31)]
[(0,8),(1,9),(2,10),(3,11),(4,12),(5,13),(6,14),(7,15),(16,24),(17,25),(18,26),(19,27),(20,28),(21,29),(22,30),(23,31)]
[(0,16),(1,8),(2,4),(3,12),(5,10),(6,9),(7,14),(11,13),(15,31),(17,24),(18,20),(19,28),(21,26),(22,25),(23,30),(27,29)]
[(1,2),(3,5),(4,8),(6,22),(7,11),(9,25),(10,12),(13,14),(17,18),(19,21),(20,24),(23,27),(26,28),(29,30)]
[(1,17),(2,18),(3,19),(4,20),(5,10),(7,23),(8,24),(11,27),(12,28),(13,29),(14,30),(21,26)]
[(3,17),(4,16),(5,21),(6,18),(7,9),(8,20),(10,26),(11,23),(13,25),(14,28),(15,27),(22,24)]
[(1,4),(3,8),(5,16),(7,17),(9,21),(10,22),(11,19),(12,20),(14,24),(15,26),(23,28),(27,30)]
[(2,5),(7,8),(9,18),(11,17),(12,16),(13,22),(14,20),(15,19),(23,24),(26,29)]
[(2,4),(6,12),(9,16),(10,11),(13,17),(14,18),(15,22),(19,25),(20,21),(27,29)]
[(5,6),(8,12),(9,10),(11,13),(14,16),(15,17),(18,20),(19,23),(21,22),(25,26)]
[(3,5),(6,7),(8,9),(10,12),(11,14),(13,16),(15,18),(17,20),(19,21),(22,23),(24,25),(26,28)]
[(3,4),(5,6),(7,8),(9,10),(11,12),(13,14),(15,16),(17,18),(19,20),(21,22),(23,24),(25,26),(27,28)]

and it will compile down to basically be the fastest 32 element sorting algorithm possible.

jukofyork · 2025-04-04T16:10:02Z

It's nearly 20 years ago now, but you can read how we used sorting networks to speed up poker hand evaluations here:

https://web.archive.org/web/20120711061324/https://archives1.twoplustwo.com/showflat.php?Cat=0&Number=8513906&page=0&fpart=9&vc=1

(sadly some of the images will no longer show, but should still make sense I think)

jukofyork · 2025-04-04T16:23:46Z

I'm not 100% sure it would work as each valley in your pictures may create a very hard to escape local minima (ie: a bit like trying to fit points to a sine wave), but you could solve this using other means

Hmm, I wonder if this being weighed rounding could help with some of this. Finding the "importance" of the channels while training may or may not be easy, though, and may not help at all. In that case, neutral weights of 1 can be used.

I think this would not be more prone to being stuck than the other QAT error functions, but I could be wrong.

It definitely would have some problems due to local minima - just think about when a weight is right on the boundary of 2 bins:

The gradient might pull it to one side or the other, and then the "attractor" at the bin centre will pull it towards it (as opposed to standard Tikhonov regularization which will have one clear attractor it heads towards regardless of the actual gradient's pull).

Or from a Bayesian perspective:

Standard Tikhonov regularization has a unimodal prior distribution.
This will have a multimodal prior distribution.

It probably won't be a huge problem, but it will get stuck (you can see something similar when you run the EM algorithm on Gaussian mixture models or k-means clustering [which is a restricted version of the EM algorithm]).

It may be more of a problem due to the sinusoidal patterns though - as in gradient space these will be very narrow valleys that require many/all variables to move in lockstep.

It certainly would be an interesting thing to try!

compilade · 2025-04-04T17:49:21Z

Are you sorting a fixed set of 32 values?

Not quite, it's sorting up to (sub_block_size)*(nmax) values where nmax is the max representable absolute integer value (e.g. Q3_K uses a range of [-4, 3], and so nmax is 4 for that type). For non-linear quants, it's the number of steps from the center value (e.g. 8 or 7 for IQ4_NL, depending on the sign of the scale. When both sides are tried, that is a total of up to 32*(7+8)=480 scales to sort, but in groups of up to 32*8=256 or 32*7=224).

It's not always a multiple of the sub-block size, because the search doesn't necessarily start at the first scale (because the first half are redundant with the second half for linear quants), and doesn't necessarily end at the last scale (because of some clamping criterion which gave good results).

The algorithm I've been trying is a hybrid of a non-comparative partial sort (e.g. counting sort) and an adaptive comparative sort algorithm (e.g. insertion sort). It seems promising for now, but I did not try with actual model weights yet to compare with the heap sort currently implemented in this PR. I did not finish adapting IQ4_NL to this sorting algorithm yet.

Sorting networks do seem extremely interesting, but they are not particularly easy to use in this case, unless what is sorted has a more constant size. But I will need to make a constant-sort-size version of these algorithms anyway to implement them in gguf-py/gguf/quants.py, so I will consider sorting networks at some point.

It definitely would have some problems due to local minima

Sure, but current QAT algorithms also need to handle this problem since they mostly use absmax round-to-nearest quantization.

just think about when a weight is right on the boundary of 2 bins

My cumulative search algorithms were made to ensure that if a values is on a boundary of 2 bins, either way should result in the same weighted error.

I think it would be possible to prove this behavior with exact numbers, or within a small epsilon with floating point numbers.
It can at least informally be seen from the plots.

It may be more of a problem due to the sinusoidal patterns though - as in gradient space these will be very narrow valleys that require many/all variables to move in lockstep.

The sinusoidal pattern might be due to the unwrapped equirectangular projections. When viewed inside the sphere they represent, there is no sinusoidal pattern: https://blobs.compilade.net/pannellum.htm#panorama=equirectangular-qkxh-2048.png

(Unless you meant the bumpy pattern made by the quantization error, which does have ridges kind of like -abs(sin(x)))

For sure there are many places where the variables need to move in lockstep and/or get stuck, but that is also true of the more widely-used absmax quantization.

I really appreciate your insights, @jukofyork! I have some reading to do on Tikhonov regularization.

ggml-org#12557

compilade added 13 commits February 21, 2025 13:49

ggml-quants : improve IQ4_NL, IQ4_XS, and Q3_K

dd6b840

ggml-quants : better and faster make_qkxs_quants

d0060fc

ggml-quants : improve imatrix behavior for TQ1_0, TQ2_0, Q4_0, Q5_0

6f7fe74

ggml-quants : improve TQ2_0 imatrix

f27c1af

ggml-quants : remove some commented code

0c9e442

ggml-quants : faster exhaustive IQ4_NL rounding with k_heap

30ad9c2

ggml-quants : use a max-heap for linear quants like Q3_K

3be1151

Slightly faster than the previous method.

ggml-quants : use qkxh in more places

f86b8ff

ggml-quants : use a max-heap for TQ1_0 and TQ2_0 quantization

3e4b675

ggml-quants : remove slower qsort-based cumulative search

af23abd

Merge branch 'master' into compilade/optimal-rounding

a411397

ggml-quants : restore Q2_K use of make_qp_quants

8b8b88f

Weirdly, it seems like in practice replacing this instance is not better. This is probably because of its interaction with make_qkx3_quants.

ggml-quants : fix some edge cases in make_qkxh_nl_quants

a5b1943

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 25, 2025

ikawrakow mentioned this pull request Mar 28, 2025

Quantization improvements ikawrakow/ik_llama.cpp#295

Merged

compilade marked this pull request as draft March 30, 2025 01:01

selim1903 reviewed Apr 1, 2025

View reviewed changes

ggml/src/ggml-quants.c Show resolved Hide resolved

This comment was marked as off-topic.

Sign in to view

This comment was marked as outdated.

Sign in to view

This was referenced Apr 4, 2025

Add gguf q4_k quantization pytorch/ao#2001

Merged

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open

danielhanchen added a commit to unslothai/llama.cpp that referenced this pull request Apr 8, 2025

Update ggml-quants.c

4818abb

ggml-org#12557

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-quants : weighted rounding algorithms with cumulative search #12557

ggml-quants : weighted rounding algorithms with cumulative search #12557

compilade commented Mar 25, 2025 •

edited

Loading

compilade commented Mar 25, 2025

jukofyork commented Mar 25, 2025 •

edited

Loading

jukofyork commented Mar 25, 2025 •

edited

Loading

compilade commented Mar 25, 2025

jukofyork commented Mar 25, 2025

ggerganov commented Mar 26, 2025

ggerganov commented Mar 27, 2025

schmorp commented Mar 30, 2025

compilade commented Mar 30, 2025

This comment was marked as off-topic.

This comment was marked as outdated.

jukofyork commented Apr 4, 2025 •

edited

Loading

compilade commented Apr 4, 2025 •

edited

Loading

jukofyork commented Apr 4, 2025

jukofyork commented Apr 4, 2025 •

edited

Loading

jukofyork commented Apr 4, 2025 •

edited

Loading

compilade commented Apr 4, 2025 •

edited

Loading

	struct fraction {
	// float frac;
	float numer;
	float denom;
	int i;
	};

	// comparator function for sorting fractions in make_qkxs_quants
	static inline int compare_fractions_desc(const void * a, const void * b) {
	const struct fraction * f_a = (const struct fraction *) a;
	const struct fraction * f_b = (const struct fraction *) b;
	float na = f_a->numer;
	float da = f_a->denom;
	float nb = f_b->numer;
	float db = f_b->denom;

	// Stable sort
	// a - b sorts ascending, which means
	// 1 swaps, -1 stays
	if (da == db) { // equal denominators
	return (na == nb) ? ((a > b) ? 1 : -1) : (na < nb) ? 1 : -1;
	}
	if (na == nb) { // equal numerators
	return (da > db) ? 1 : -1;
	}
	float ab = na * db;
	float ba = nb * da;
	return (ab == ba) ? ((a > b) ? 1 : -1) : (ab < ba) ? 1 : -1;
	}

	// exhaustive search with cumulative sums
	// Need Faux to have room for n*(max(abs(nmin), abs(nmax))) fractions
	static float make_qkxs_quants(int n, int nmin, int nmax, const float * restrict x, const float * restrict weights, int8_t * restrict L, int8_t * restrict Laux, struct fraction * restrict Faux, bool signed_scale) {
	float max = x[0];
	float min = x[0];
	float w_amax = weights[0] * fabsf(x[0]);
	int max_i = 0;
	int w_amax_i = 0;
	int min_i = 0;
	for (int i = 1; i < n; ++i) {
	if (x[i] < min) { min = x[i]; min_i = i; }
	if (x[i] > max) { max = x[i]; max_i = i; }
	// Find the most important value
	const float w = weights[i];
	const float wax = w * fabsf(x[i]);
	if (wax > w_amax) {
	w_amax = wax;
	w_amax_i = i;
	}
	}
	const int amax_i = fabsf(min) > fabsf(max) ? min_i : max_i;
	const float amax = fabsf(x[amax_i]);

	if (amax < GROUP_MAX_EPS) { // all zero
	for (int i = 0; i < n; ++i) {
	L[i] = 0;
	}
	return 0.0f;
	}

	bool negative_scale = false;
	if (signed_scale && -nmin != nmax) {
	// the max side should have the biggest range
	// FIXME: this is not always the best sign
	if ((x[amax_i] < 0.0f) == (-nmin < nmax)) {
	// [-4, 3] ==> [-3, 4]
	const int tmp = nmin;
	const float ftmp = min;
	nmin = -nmax;
	nmax = -tmp;
	min = -max;
	max = -ftmp;
	negative_scale = true;
	}
	}

	// Find the max range in [0, amax_range] which doesn't result in clamping.
	// This is the range from the side which would clamp first (biggest ratio of max to nmax).
	int amax_range;
	float range_max;
	if (fabsf(-max * nmin) < fabsf(-min * nmax)) {
	amax_range = MAX(0, -nmin);
	range_max = fabsf(min);
	} else {
	amax_range = MAX(0, nmax);
	range_max = fabsf(max);
	}
	float sumlx = 0.0f;
	float suml2 = 0.0f;
	float scale = 0.0f;
	float best = 0.0f;
	float best_denom = 1.0f;
	if (amax_range > 1) {
	// The smallest non-redundant iscale makes the first clamped value half+1 its max integer value.
	// Proof: anything smaller has a representable vector with values twice as big.
	const float iscale = ((float)(amax_range / 2 + 1))/range_max * (negative_scale ? -1.0f : 1.0f);
	for (int i = 0; i < n; ++i) {
	const float w = weights[i];
	int l = MAX(nmin, MIN(lroundf(x[i] * iscale), nmax));
	if (negative_scale) { l = -l; }
	Laux[i] = l;
	L[i] = l;
	suml2 += w * l * l;
	sumlx += w * l * x[i];
	}
	best = sumlx * sumlx;
	best_denom = suml2; // should never be zero
	scale = sumlx / suml2;
	} else {
	for (int i = 0; i < n; ++i) {
	Laux[i] = 0;
	L[i] = 0;
	}
	}

	const int imax_range = MAX(0, (x[w_amax_i] < 0.0f) ? -nmin : nmax);
	const int max_odd = 2*(imax_range + 1) + 1;
	const float wmax = fabsf(x[w_amax_i]);
	int n_frac = 0;
	for (int i = 0; i < n; ++i) {
	// assuming nmin <= nmax
	const int odd_max = MAX(abs(Laux[i]), x[i] < 0.0f ? -nmin : nmax);
	const int odd_min = MAX(abs(Laux[i]), x[i] < 0.0f ? -nmax : nmin);
	const float v = fabsf(x[i]);
	const float v_max_odd = v * max_odd;
	for (int j = odd_min; j < odd_max; ++j) {
	const float odd = 2*j + 1;
	if (wmax * odd < v_max_odd) {
	Faux[n_frac++] = (struct fraction){
	.numer=v,
	.denom=odd,
	.i=i,
	};
	} else {
	// stop when the inverse scale would result in clamping the most important value
	break;
	}
	}
	}

	qsort(Faux, n_frac, sizeof(struct fraction), compare_fractions_desc);

	int best_p_i = -1; // consecutive with 0..n_frac
	for (int i = 0; i < n_frac; ++i) {
	// maximize the weighted cosine
	const int ii = Faux[i].i;
	const float w = weights ? weights[ii] : x[ii] * x[ii];
	sumlx += w * Faux[i].numer;
	suml2 += w * Faux[i].denom;
	const float current = sumlx * sumlx;
	Laux[ii] += x[ii] < 0.0f ? -1 : 1;
	if (suml2 > 0.0f && Faux[i].numer > 0.0f && current * best_denom > best * suml2) {
	best = current;
	best_denom = suml2;
	scale = sumlx / suml2;
	if (i == best_p_i + 1) {
	// reduce copies for consecutive bests
	L[ii] += x[ii] < 0.0f ? -1 : 1;
	} else {
	for (int j = 0; j < n; ++j) {
	L[j] = Laux[j];
	}
	}
	best_p_i = i;
	}
	}
	for (int i = 0; i < n; ++i) {
	L[i] = negative_scale ? (-L[i] + nmax) : (L[i] + -nmin);
	GGML_ASSERT(L[i] >= 0 && L[i] <= nmax - nmin);
	}

	return negative_scale ? -scale : scale;
	}

	// Very similar to make_qkxs_quants, but the sign of the scale is not assumed to be the sign of the absmax value.
	static float make_qkxss_quants(int n, int nmin, int nmax, const float * restrict x, const float * restrict weights, int8_t * restrict L, int8_t * restrict Laux, struct fraction * restrict Faux) {
	// start at zero
	nmin = MIN(0, nmin);
	nmax = MAX(0, nmax);
	float amax = 0.0f;
	float min = 0.0f;
	float max = 0.0f;
	float w_amax = 0.0f;
	int amax_i = -1;
	int w_amax_i = -1;
	for (int i = 0; i < n; ++i) {
	const float w = weights ? weights[i] : x[i] * x[i];
	const float ax = fabsf(x[i]);
	const float wax = w * ax;
	if (ax > amax) { amax = ax; amax_i = i; }
	if (x[i] > max) { max = x[i]; }
	if (x[i] < min) { min = x[i]; }
	// Find the most important value
	if (wax > w_amax) { w_amax = wax; w_amax_i = i; }
	}

	if (amax < GROUP_MAX_EPS \|\| amax_i < 0 \|\| w_amax_i < 0) { // all zero
	for (int i = 0; i < n; ++i) { L[i] = 0; }
	return 0.0f;
	}

	// Use the side which will clamp first.
	// The first clamped value is the absmax at the end of the common range.
	// TODO: reduce the search space when one of the ranges is 0
	const int amax_range = MIN(-nmin, nmax);
	float sumlx_p = 0.0f;
	float suml2_p = 0.0f;
	float sumlx_n = 0.0f;
	float suml2_n = 0.0f;
	float scale = 0.0f;
	float best = 0.0f;
	float best_denom = 1.0f;
	int best_i = -2; // not consecutive with 0..n_frac
	// Pre-calculate the half-point for the common range.
	// All smaller vectors have a representable vector with twice the values, and thus can be skipped.
	if (amax_range > 1) {
	const float iscale = ((float)(amax_range / 2 + 1))/amax;
	for (int i = 0; i < n; ++i) {
	const float w = weights ? weights[i] : x[i] * x[i];
	int l = MAX(nmin, MIN(lroundf(x[i] * iscale), nmax));
	Laux[i] = l;
	suml2_p += w * l * l;
	sumlx_p += w * l * x[i];
	}
	sumlx_n = -sumlx_p;
	suml2_n = suml2_p;
	const float current_p = sumlx_p * sumlx_p;
	if (suml2_p > 0.0f && current_p * best_denom > best * suml2_p) {
	best = current_p;
	best_denom = suml2_p;
	scale = sumlx_p / suml2_p;
	for (int i = 0; i < n; ++i) {
	L[i] = Laux[i];
	}
	best_i = -1; // right before 0 of the loop after sorting
	}
	} else {
	for (int i = 0; i < n; ++i) {
	Laux[i] = 0;
	}
	}

	const int imax_range = MAX(nmax, -nmin);
	const int max_odd = 2*(imax_range + 1) + 1;
	const float wmax = fabsf(x[w_amax_i]);
	int n_frac = 0;
	for (int i = 0; i < n; ++i) {
	// assuming nmin <= nmax
	const int odd_max = MAX(nmax, -nmin);
	const float v = fabsf(x[i]);
	const float v_max_odd = v * max_odd;
	for (int j = abs(Laux[i]); j < odd_max; ++j) {
	const float odd = 2*j + 1;
	const float wmax_odd = wmax * odd;
	if (wmax_odd < v_max_odd) {
	Faux[n_frac++] = (struct fraction){
	.numer=v,
	.denom=odd,
	.i=i,
	};
	} else {
	// stop when the inverse scale would result in clamping the most important value
	break;
	}
	}
	}

	qsort(Faux, n_frac, sizeof(struct fraction), compare_fractions_desc);

	const float max_common_odd = (MIN(nmax, -nmin) * 2) + 1;
	const float max_odd_p = (nmax * 2) + 1;
	const float max_odd_n = (-nmin * 2) + 1;

	for (int i = 0; i < n_frac; ++i) {
	// maximize the weighted cosine similarity
	const int ii = Faux[i].i;
	const float w = weights ? weights[ii] : x[ii] * x[ii];
	const float lx = w * Faux[i].numer;
	const float odd = Faux[i].denom;
	const float l2 = w * odd;

	Laux[ii] += x[ii] < 0.0f ? -1 : 1;

	float sumlx = 0.0f;
	float proj = 0.0f;
	float norm = 0.0f;
	if (odd < max_common_odd) {
	sumlx_p += lx;
	suml2_p += l2;
	sumlx_n -= lx;
	suml2_n += l2;

	sumlx = sumlx_p;
	proj = sumlx_p * sumlx_p;
	norm = suml2_p;

	// avoid double-copying Laux in a single iteration
	if (suml2_p != suml2_n && suml2_p * suml2_n > 0.0f) {
	const float proj_n = sumlx_n * sumlx_n;
	if (proj_n * norm > proj * suml2_n) {
	sumlx = sumlx_n;
	proj = proj_n;
	norm = suml2_n;
	}
	}
	} else if (x[ii] < 0.0f ? odd < max_odd_n : odd < max_odd_p) {
	sumlx_p += lx;
	suml2_p += l2;

	sumlx = sumlx_p;
	proj = sumlx_p * sumlx_p;
	norm = suml2_p;
	} else {
	// outside the positive range means we're now into negatives
	sumlx_n -= lx;
	suml2_n += l2;

	sumlx = sumlx_n;
	proj = sumlx_n * sumlx_n;
	norm = suml2_n;
	}
	if (norm > 0.0f && proj * best_denom > best * norm) {
	best = proj;
	best_denom = norm;
	scale = sumlx / norm;
	if (i == best_i + 1) {
	// reduce copies for consecutive bests
	L[ii] += x[ii] < 0.0f ? -1 : 1;
	} else {
	for (int j = 0; j < n; ++j) {
	L[j] = Laux[j];
	}
	}
	best_i = i;
	}
	}

	if (scale < 0.0f) {
	for (int i = 0; i < n; ++i) {
	L[i] = MAX(nmin, MIN(-L[i], nmax)) - nmin;
	}
	} else {
	for (int i = 0; i < n; ++i) {
	L[i] = MAX(nmin, MIN(L[i], nmax)) - nmin;
	}
	}

	return scale;
	}

	// non-linear exhaustive search with cumulative sums
	// Need Faux to have room for n*k fractions
	static float make_qkxs_nl_quants(int n, int k, const float * restrict x, const float * restrict weights, const int8_t * restrict kvalues, uint8_t * restrict L, uint8_t * restrict Laux, struct fraction * restrict Faux, bool signed_scale) {
	float sumlx = 0.0f;
	float suml2 = 0.0f;
	int kmin = abs(kvalues[0]);
	int koff = 0;
	for (int i = 1; i < k; ++i) {
	int ak = abs(kvalues[i]);
	if (ak < kmin) {
	kmin = ak;
	koff = i;
	}
	}
	kmin = kvalues[koff];
	for (int i = 0; i < n; ++i) {
	float w = weights ? weights[i] : x[i] * x[i];
	Laux[i] = koff;
	sumlx += w * x[i] * kmin;
	suml2 += w * kmin * kmin;
	}

	int n_frac_p = 0;
	for (int i = 0; i < n; ++i) {
	const int start = x[i] < 0.0f ? 1 : koff + 1;
	const int end = x[i] < 0.0f ? koff + 1: k;
	for (int j = start; j < end; ++j) {
	const float threshold = kvalues[j] + kvalues[j - 1];
	const float step = kvalues[j] - kvalues[j - 1];
	Faux[n_frac_p++] = (struct fraction){
	// This should always be positive or else
	// the fraction comparison function won't work properly
	.numer=fabsf(x[i] * step),
	// It's amazing how this is still the difference of consecutive squares
	.denom=fabsf(threshold * step),
	.i=i,
	};
	}
	}

	qsort(Faux, n_frac_p, sizeof(struct fraction), compare_fractions_desc);

	float best = 0.0f;
	float best_sumlx = 0.0f;
	float best_suml2 = 1.0f;
	float sumlx_p = sumlx;
	float suml2_p = suml2;
	int best_p_i = -2; // not consecutive with 0..n_frac
	for (int i = 0; i < n_frac_p; ++i) {
	const int ii = Faux[i].i;
	const float w = weights ? weights[ii] : x[ii] * x[ii];
	sumlx_p += w * Faux[i].numer;
	suml2_p += w * Faux[i].denom;
	const float current = sumlx_p * sumlx_p;
	Laux[ii] += x[ii] < 0.0f ? -1 : 1;
	if (suml2_p > 0.0f && current * best_suml2 > best * suml2_p) {
	best = current;
	best_sumlx = sumlx_p;
	best_suml2 = suml2_p;
	if (i == best_p_i + 1) {
	// reduce copies for consecutive bests
	L[ii] += x[ii] < 0.0f ? -1 : 1;
	} else {
	for (int j = 0; j < n; ++j) {
	L[j] = Laux[j];
	}
	}
	best_p_i = i;
	}
	}

	// Non-linear mappings are usually not symmetric, so try negating the scale
	// This is the same as above, but keeping the old best if the new best is not better.
	if (signed_scale) {
	for (int i = 0; i < n; ++i) {
	Laux[i] = koff;
	}

	int n_frac_n = 0;
	for (int i = 0; i < n; ++i) {
	const int start = x[i] >= 0.0f ? 1 : koff + 1;
	const int end = x[i] >= 0.0f ? koff + 1: k;
	for (int j = start; j < end; ++j) {
	const float threshold = kvalues[j] + kvalues[j - 1];
	const float step = kvalues[j] - kvalues[j - 1];
	Faux[n_frac_n++] = (struct fraction){
	// This should always be positive or else
	// the fraction comparison function won't work properly
	.numer=fabsf(x[i] * step),
	// It's amazing how this is still the difference of consecutive squares
	.denom=fabsf(threshold * step),
	.i=i,
	};
	}
	}

	qsort(Faux, n_frac_n, sizeof(struct fraction), compare_fractions_desc);

	float sumlx_n = -sumlx;
	float suml2_n = suml2;
	int best_n_i = -2; // not consecutive with 0..n_frac
	for (int i = 0; i < n_frac_n; ++i) {
	const int ii = Faux[i].i;
	const float w = weights ? weights[ii] : x[ii] * x[ii];
	sumlx_n += w * Faux[i].numer;
	suml2_n += w * Faux[i].denom;
	const float current = sumlx_n * sumlx_n;
	Laux[ii] += x[ii] >= 0.0f ? -1 : 1;
	if (suml2_n > 0.0f && current * best_suml2 > best * suml2_n) {
	best = current;
	best_sumlx = -sumlx_n;
	best_suml2 = suml2_n;
	if (i == best_n_i + 1) {
	// reduce copies for consecutive bests
	L[ii] += x[ii] >= 0.0f ? -1 : 1;
	} else {
	for (int j = 0; j < n; ++j) {
	L[j] = Laux[j];
	}
	}
	best_n_i = i;
	}
	}
	}

	return best_suml2 != 0.0f ? best_sumlx / best_suml2 : 0.0f;
	}

ggml-quants : weighted rounding algorithms with cumulative search #12557

Are you sure you want to change the base?

ggml-quants : weighted rounding algorithms with cumulative search #12557

Conversation

compilade commented Mar 25, 2025 • edited Loading

Affected types

KL-Divergence

Qwen2.5-Coder-3B-Instruct

Llama-3.1-8B-Instruct

What changed in the algorithms?

Algorithms used on master

make_q3_quants

make_qx_quants

make_qp_quants

quantize_row_iq4_nl_impl

Algorithms from this PR

make_qkxh_quants

make_qkxsh_quants

make_qkxh_nl_quants

Notes

TODO in future PRs

compilade commented Mar 25, 2025

jukofyork commented Mar 25, 2025 • edited Loading

jukofyork commented Mar 25, 2025 • edited Loading

compilade commented Mar 25, 2025

jukofyork commented Mar 25, 2025

ggerganov commented Mar 26, 2025

ggerganov commented Mar 27, 2025

schmorp commented Mar 30, 2025

compilade commented Mar 30, 2025

This comment was marked as off-topic.

This comment was marked as outdated.

jukofyork commented Apr 4, 2025 • edited Loading

compilade commented Apr 4, 2025 • edited Loading

jukofyork commented Apr 4, 2025

jukofyork commented Apr 4, 2025 • edited Loading

jukofyork commented Apr 4, 2025 • edited Loading

compilade commented Apr 4, 2025 • edited Loading

compilade commented Mar 25, 2025 •

edited

Loading

`Qwen2.5-Coder-3B-Instruct`

`Llama-3.1-8B-Instruct`

Algorithms used on `master`

`make_q3_quants`

`make_qx_quants`

`make_qp_quants`

`quantize_row_iq4_nl_impl`

`make_qkxh_quants`

`make_qkxsh_quants`

`make_qkxh_nl_quants`

jukofyork commented Mar 25, 2025 •

edited

Loading

jukofyork commented Mar 25, 2025 •

edited

Loading

jukofyork commented Apr 4, 2025 •

edited

Loading

compilade commented Apr 4, 2025 •

edited

Loading

jukofyork commented Apr 4, 2025 •

edited

Loading

jukofyork commented Apr 4, 2025 •

edited

Loading

compilade commented Apr 4, 2025 •

edited

Loading