-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
Is there a reason for not increasing the threshold to prevent the random failures? |
Generally speaking, for any classifier you have a tradeoff between false positives and false negatives. It's difficult to increase the thresholds further without the risk of not detecting some potential bug. Part of the problem is that the different backends have slightly different implementations which can be amplified with unlucky RNG. The tests could instead be run on larger sample sizes where the likelihood of randomly different results is smaller but then the tests would also take longer. |
I think a significant part of the problem here is that the CPU backend is being used as the reference implementation, and a reference implementation should attempt to minimize the error so that it can be used to check accurately the error of the other implementations. However, that's not the case in matrix multiplications with quantized types, since the CPU backend always quantizes the activations first, introducing additional error, and thus it is not really suitable as a reference implementation. It may be better to use an implementation that dequantizes the weights to F32 and performs the matrix multiplication in F32. The BLAS backend does precisely that. This would allow us to accurately estimate the error and set a higher bound that for the maximum acceptable error that we can confidently set. |
Would the option to add an optional static seed to prevent sporadic failures be viable? The thought behind this in the case of CI running the tests, it might be preferable. |
I think it is preferable to continue using random seeds, because this is not an spurious error that should be ignored. |
@slaren can you clarify what you mean by that? To me this looks exactly like a spurious error that should be ignored in the sense that it does not point to an issue with a backend. Concretely our CI jobs testing the SYCL backend on Nvidia HW sometimes fail with an error that we need to ignore. We reported the issue only for the CUDA backend but we're looking for a solution that would also solve the issue with SYCL. I understand changing the reference for mul_mat could fix the issue but I feel this brings more questions. Do you suggest to use the blas backend only for mul_mat? Is it acceptable to add a dependency in order to run the tests? Are we sure this change would make sense for every backend? or do you suggest to use this solution only for some backends? I don't think we need to answer these questions as part of this issue. We were thinking that introducing an option to use fixed seed would have the least impact. It would also be useful for CIs and for debugging purposes to more easily reproduce any issues (not just affecting mul_mat). Just to clarify, do you think anyone would be opposed to a PR that introduces an option to use a fixed seed if we were to submit the change? |
If it is a bug due to the reference implementation used by I do not think that using BLAS as a general reference implementation for |
I think the issue is more likely to be a flaw in the testing methodology than an incorrect implementation. Updating the reference to match other backends would be good but I think it is difficult to do in projects supporting multiple backends that are allowed to have different precision requirements. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz)
version: 4667 (d2fe216)
built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Test code
Command line
`./bin/test-backend-ops`
Problem description & steps to reproduce
Test failure was encountered while running MUL_MAT trough
test-backend-ops
.MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3])
Test case created Heretest-backend-ops
did not reproduce the error. Modifying thetest-backend-ops.cpp
by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)The nvidia backend seems to convert the
src1
to aQ8_1
type and then run mul_mat with inputsQ5_1
andQ8_1
. Could this be causing the precision issue?The largest encountered NMSE from 20000 runs was identified as
0.001409
Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to
5e-4
. Should this be modified?First Bad Commit
Due to the sporadic nature of the test failure, the commit (d2fe216) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498)
Relevant log output
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000 0 0.948417 1.035245, diff = -0.086828 1 -2.924956 -2.844111, diff = -0.080845 2 -1.777758 -1.695090, diff = -0.082667 3 0.450649 0.537106, diff = -0.086457 4 -4.114096 -4.030904, diff = -0.083191 5 -0.682358 -0.596930, diff = -0.085428 6 -8.252451 -8.167437, diff = -0.085014 7 -0.692235 -0.606851, diff = -0.085384 8 -5.382234 -5.304606, diff = -0.077628 9 3.467584 3.552903, diff = -0.085320 10 -7.941753 -7.861615, diff = -0.080138 11 3.101702 3.186424, diff = -0.084722 12 0.954475 1.037351, diff = -0.082876 13 2.353770 2.437956, diff = -0.084186 14 -1.223359 -1.139174, diff = -0.084185 15 0.853322 0.939753, diff = -0.086431
The text was updated successfully, but these errors were encountered: