Skip to content

Commit ef0307e

Browse files
committed
Clamp out of range values in K quantizer
This assertion fails when quantizing Mixtral 8x7b as Q5_K_M, because I used `convert.py --outtype f32` and the Mixtral weights use bf16 which has a much larger exponent range than the K quantizer is expecting. If --outtype f16 is used then the assert doesn't fail. See ggml-org/llama.cpp#2982 cc: @JohannesGaessler
1 parent a8b0b15 commit ef0307e

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

llama.cpp/ggml-quants.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1314,7 +1314,11 @@ void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int
13141314
// ===================== Helper functions
13151315
//
13161316
static inline int nearest_int(float fval) {
1317-
assert(fval <= 4194303.f);
1317+
1318+
// [jart] https://github.com/ggerganov/llama.cpp/issues/2982
1319+
// assert(fval <= 4194303.f);
1320+
fval = fminf(fval, 4194303.f);
1321+
13181322
float val = fval + 12582912.f;
13191323
int i; memcpy(&i, &val, sizeof(int));
13201324
return (i & 0x007fffff) - 0x00400000;

0 commit comments

Comments
 (0)