Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

ShanoToni · 2025-02-20T13:13:44Z

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz)
version: 4667 (d2fe216)
built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Test code

Command line

`./bin/test-backend-ops`

Problem description & steps to reproduce

Test failure was encountered while running MUL_MAT trough test-backend-ops.

The failing mulmat configuration was identified as MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]) Test case created Here
Failures seemed random, consecutive runs of test-backend-ops did not reproduce the error. Modifying the test-backend-ops.cpp by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)

// Example of adding failing mul_mat case
    for (int i = 0; i < 1000; i++) {
        test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q5_1, GGML_TYPE_F32, 16, 1, 256, {1,  1}, {1, 1}));
    }

The test fails due to NMSE being over the maximum error threshold.
Example error output:

  MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431

The nvidia backend seems to convert the src1 to a Q8_1 type and then run mul_mat with inputs Q5_1 and Q8_1. Could this be causing the precision issue?
The largest encountered NMSE from 20000 runs was identified as 0.001409
Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to 5e-4. Should this be modified?

First Bad Commit

Due to the sporadic nature of the test failure, the commit (d2fe216) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498)

Relevant log output

MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2025-02-20T13:50:50Z

test-backend-ops uses a random seed for generating the test data. It is expected that the NMSE will sometimes exceed the threshold.

ShanoToni · 2025-02-27T12:47:50Z

Is there a reason for not increasing the threshold to prevent the random failures?

JohannesGaessler · 2025-02-27T18:15:38Z

Generally speaking, for any classifier you have a tradeoff between false positives and false negatives. It's difficult to increase the thresholds further without the risk of not detecting some potential bug. Part of the problem is that the different backends have slightly different implementations which can be amplified with unlucky RNG. The tests could instead be run on larger sample sizes where the likelihood of randomly different results is smaller but then the tests would also take longer.

slaren · 2025-02-27T18:24:55Z

Part of the problem is that the different backends have slightly different implementations which can be amplified with unlucky RNG.

I think a significant part of the problem here is that the CPU backend is being used as the reference implementation, and a reference implementation should attempt to minimize the error so that it can be used to check accurately the error of the other implementations. However, that's not the case in matrix multiplications with quantized types, since the CPU backend always quantizes the activations first, introducing additional error, and thus it is not really suitable as a reference implementation.

It may be better to use an implementation that dequantizes the weights to F32 and performs the matrix multiplication in F32. The BLAS backend does precisely that. This would allow us to accurately estimate the error and set a higher bound that for the maximum acceptable error that we can confidently set.

ShanoToni · 2025-03-04T11:46:58Z

Would the option to add an optional static seed to prevent sporadic failures be viable? The thought behind this in the case of CI running the tests, it might be preferable.

slaren · 2025-03-05T13:59:54Z

I think it is preferable to continue using random seeds, because this is not an spurious error that should be ignored.

Rbiessy · 2025-03-07T14:45:28Z

this is not an spurious error that should be ignored.

@slaren can you clarify what you mean by that? To me this looks exactly like a spurious error that should be ignored in the sense that it does not point to an issue with a backend. Concretely our CI jobs testing the SYCL backend on Nvidia HW sometimes fail with an error that we need to ignore. We reported the issue only for the CUDA backend but we're looking for a solution that would also solve the issue with SYCL.

I understand changing the reference for mul_mat could fix the issue but I feel this brings more questions. Do you suggest to use the blas backend only for mul_mat? Is it acceptable to add a dependency in order to run the tests? Are we sure this change would make sense for every backend? or do you suggest to use this solution only for some backends? I don't think we need to answer these questions as part of this issue.

We were thinking that introducing an option to use fixed seed would have the least impact. It would also be useful for CIs and for debugging purposes to more easily reproduce any issues (not just affecting mul_mat). Just to clarify, do you think anyone would be opposed to a PR that introduces an option to use a fixed seed if we were to submit the change?
If that's not possible we will look into other options to make our CI green.
Thanks for your input!

slaren · 2025-03-07T15:14:41Z

test-backend-ops does not perform an exhaustive test, it takes a single random sample and compares it to a reference implementation. Therefore an incorrect implementation failing some tests only sporadically is completely expected, and is still indicative of an incorrect implementation, or at best a flaw in the testing methodology. Using a fixed seed increases the possibility an implementation error being hidden purely by chance, and IMO is not a good solution to this problem. My question to you is, why do you think it is acceptable for a backend to exceed the error threshold for some inputs?

If it is a bug due to the reference implementation used by test-backend-ops, then that should be fixed. If instead transparent quantization of the activations by backends causes unacceptable error, then that is still something that must be considered and eventually fixed. Ignoring the error does no good to anyone.

I do not think that using BLAS as a general reference implementation for test-backend-ops is acceptable, I only mentioned it because it could be useful to pinpoint the issue. Adding an option to use a fixed seed just to silence a CI seems completely wrong to me, but it may be interesting as a way to reliably reproduce an issue.

Rbiessy · 2025-03-10T09:19:10Z

My question to you is, why do you think it is acceptable for a backend to exceed the error threshold for some inputs?

I think the issue is more likely to be a flaw in the testing methodology than an incorrect implementation. Updating the reference to match other backends would be good but I think it is difficult to do in projects supporting multiple backends that are allowed to have different precision requirements.
I believe allowing to use a fixed seed would be easier and still useful even if this issue were fixed with your suggestion. If there is not enough interest for this option then no matter. Thanks again.

github-actions · 2025-04-24T01:11:18Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ShanoToni added the bug-unconfirmed label Feb 20, 2025

0cc4m mentioned this issue Mar 20, 2025

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

Merged

github-actions bot added the stale label Apr 10, 2025

github-actions bot closed this as completed Apr 24, 2025

Alcpz mentioned this issue Apr 28, 2025

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs #12858

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

ShanoToni commented Feb 20, 2025

JohannesGaessler commented Feb 20, 2025

ShanoToni commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

slaren commented Feb 27, 2025

ShanoToni commented Mar 4, 2025

slaren commented Mar 5, 2025

Rbiessy commented Mar 7, 2025

slaren commented Mar 7, 2025

Rbiessy commented Mar 10, 2025

github-actions bot commented Apr 24, 2025

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Comments

ShanoToni commented Feb 20, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

JohannesGaessler commented Feb 20, 2025

ShanoToni commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

slaren commented Feb 27, 2025

ShanoToni commented Mar 4, 2025

slaren commented Mar 5, 2025

Rbiessy commented Mar 7, 2025

slaren commented Mar 7, 2025

Rbiessy commented Mar 10, 2025

github-actions bot commented Apr 24, 2025