Skip to content

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ShanoToni opened this issue Feb 20, 2025 · 10 comments
Closed

Comments

@ShanoToni
Copy link

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz)
version: 4667 (d2fe216)
built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Test code

Command line

`./bin/test-backend-ops`

Problem description & steps to reproduce

Test failure was encountered while running MUL_MAT trough test-backend-ops.

  • The failing mulmat configuration was identified as MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]) Test case created Here
  • Failures seemed random, consecutive runs of test-backend-ops did not reproduce the error. Modifying the test-backend-ops.cpp by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)
// Example of adding failing mul_mat case
    for (int i = 0; i < 1000; i++) {
        test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q5_1, GGML_TYPE_F32, 16, 1, 256, {1,  1}, {1, 1}));
    }
  • The test fails due to NMSE being over the maximum error threshold.
  • Example error output:
  MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
  • The nvidia backend seems to convert the src1 to a Q8_1 type and then run mul_mat with inputs Q5_1 and Q8_1. Could this be causing the precision issue?

  • The largest encountered NMSE from 20000 runs was identified as 0.001409

  • Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to 5e-4. Should this be modified?

First Bad Commit

Due to the sporadic nature of the test failure, the commit (d2fe216) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498)

Relevant log output

MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
@JohannesGaessler
Copy link
Collaborator

test-backend-ops uses a random seed for generating the test data. It is expected that the NMSE will sometimes exceed the threshold.

@ShanoToni
Copy link
Author

Is there a reason for not increasing the threshold to prevent the random failures?

@JohannesGaessler
Copy link
Collaborator

Generally speaking, for any classifier you have a tradeoff between false positives and false negatives. It's difficult to increase the thresholds further without the risk of not detecting some potential bug. Part of the problem is that the different backends have slightly different implementations which can be amplified with unlucky RNG. The tests could instead be run on larger sample sizes where the likelihood of randomly different results is smaller but then the tests would also take longer.

@slaren
Copy link
Member

slaren commented Feb 27, 2025

Part of the problem is that the different backends have slightly different implementations which can be amplified with unlucky RNG.

I think a significant part of the problem here is that the CPU backend is being used as the reference implementation, and a reference implementation should attempt to minimize the error so that it can be used to check accurately the error of the other implementations. However, that's not the case in matrix multiplications with quantized types, since the CPU backend always quantizes the activations first, introducing additional error, and thus it is not really suitable as a reference implementation.

It may be better to use an implementation that dequantizes the weights to F32 and performs the matrix multiplication in F32. The BLAS backend does precisely that. This would allow us to accurately estimate the error and set a higher bound that for the maximum acceptable error that we can confidently set.

@ShanoToni
Copy link
Author

Would the option to add an optional static seed to prevent sporadic failures be viable? The thought behind this in the case of CI running the tests, it might be preferable.

@slaren
Copy link
Member

slaren commented Mar 5, 2025

I think it is preferable to continue using random seeds, because this is not an spurious error that should be ignored.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Mar 7, 2025

this is not an spurious error that should be ignored.

@slaren can you clarify what you mean by that? To me this looks exactly like a spurious error that should be ignored in the sense that it does not point to an issue with a backend. Concretely our CI jobs testing the SYCL backend on Nvidia HW sometimes fail with an error that we need to ignore. We reported the issue only for the CUDA backend but we're looking for a solution that would also solve the issue with SYCL.

I understand changing the reference for mul_mat could fix the issue but I feel this brings more questions. Do you suggest to use the blas backend only for mul_mat? Is it acceptable to add a dependency in order to run the tests? Are we sure this change would make sense for every backend? or do you suggest to use this solution only for some backends? I don't think we need to answer these questions as part of this issue.

We were thinking that introducing an option to use fixed seed would have the least impact. It would also be useful for CIs and for debugging purposes to more easily reproduce any issues (not just affecting mul_mat). Just to clarify, do you think anyone would be opposed to a PR that introduces an option to use a fixed seed if we were to submit the change?
If that's not possible we will look into other options to make our CI green.
Thanks for your input!

@slaren
Copy link
Member

slaren commented Mar 7, 2025

test-backend-ops does not perform an exhaustive test, it takes a single random sample and compares it to a reference implementation. Therefore an incorrect implementation failing some tests only sporadically is completely expected, and is still indicative of an incorrect implementation, or at best a flaw in the testing methodology. Using a fixed seed increases the possibility an implementation error being hidden purely by chance, and IMO is not a good solution to this problem. My question to you is, why do you think it is acceptable for a backend to exceed the error threshold for some inputs?

If it is a bug due to the reference implementation used by test-backend-ops, then that should be fixed. If instead transparent quantization of the activations by backends causes unacceptable error, then that is still something that must be considered and eventually fixed. Ignoring the error does no good to anyone.

I do not think that using BLAS as a general reference implementation for test-backend-ops is acceptable, I only mentioned it because it could be useful to pinpoint the issue. Adding an option to use a fixed seed just to silence a CI seems completely wrong to me, but it may be interesting as a way to reliably reproduce an issue.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Mar 10, 2025

My question to you is, why do you think it is acceptable for a backend to exceed the error threshold for some inputs?

I think the issue is more likely to be a flaw in the testing methodology than an incorrect implementation. Updating the reference to match other backends would be good but I think it is difficult to do in projects supporting multiple backends that are allowed to have different precision requirements.
I believe allowing to use a fixed seed would be easier and still useful even if this issue were fixed with your suggestion. If there is not enough interest for this option then no matter. Thanks again.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants