-
Notifications
You must be signed in to change notification settings - Fork 899
attribute functions lead to application segfault #10339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
4.1.4v1 works, so the bug was introduced in the 5.0 branch. v5.0.0rc1 and v5.0.0rc2 work. v5.0.0rc6, v5.0.0rc5, v5.0.0rc4 and v5.0.0rc3 are broken. |
I speculate that #10070 is the culprit, based on the timing. That commit is from March 5, and the error first appears in the tag created on March 8. @bwbarrett can you take a look at this? Generally, can the developers of Open-MPI please add more tests so this doesn't happen again? It seems rather shocking that the Open-MPI test suite does not contain basic unit tests of object attribute functions. There are numerous tests in the MPICH test suite that you all can use, or you can just run the ARMCI-MPI test suite so I don't have to do it manually as a maintainer of a dependent project, since there is overwhelming evidence that you all have a problem when it comes to RMA QA.
|
1bcc6b1 is broken. Now to check the commit before it... |
I was able to successfully pass the test program with both FWIW, I configure'd with @jeffhammond can you please copy/paste your |
Also, you can see the configure used in ompi_info above. That's why I didn't include it separately. |
I tried again without which compiler are you using? BTW, the latest updates are in the |
I used main but also, you can see I have the failure with multiple 5.0.x release candidates. |
See ompi_info above for compiler information: C compiler absolute: /bin/gcc |
To rule out other causes, I've removed all MPI-related packages from Apt and built Open-MPI statically. It seems static is broken in that I have to manually remove I guess I'll rule out GCC 11 being the cause but I'm really skeptical that this is a C compiler bug.
|
Okay, this is some weird ****. GCC 7 is fine. I'll figure out which GCC versions are bad, but if it's entirely possible it's a UB situation, in which case it's still an Open-MPI problem.
But the fact remains, when GCC 11 is used (and not GCC 10 or older), the following is true, which implies that some change in Open MPI between v5.0.0rc2 and v5.0.0rc3 breaks a trivial MPI utility function.
|
@jeffhammond I ran mpi4py testsuite on the commit from your original post under GitHub Actions https://github.com/mpi4py/mpi4py-testing/actions/workflows/openmpi.yml Full logs here: https://github.com/mpi4py/mpi4py-testing/runs/6227884365?check_suite_focus=true
|
I rebuild with the same command and gcc 11.1.0 on rhel7 and was unable to reproduce the issue :-( which ubuntu flavor are you running on? |
Well, it is the ubuntu-20.04 image from GitHub Actions's runners. From the logs:
|
@ggouaillardet This looks like memory corruption. Have you run Jeff's reproducer under valgrind? |
@ggouaillardet I ran my CI over PS: Perhaps I should run mpi4py tests daily rather than weekly. But It would be even better if you copy over my CI and setup your own GitHub Actions to run mpi4py testsuite. Or at least join this repo (note, it is not the main mpi4py repo, but one used exclusively for running tests manually or on schedule) https://github.com/mpi4py/mpi4py-testing to get notifications when the scheduled tests fail. |
I see the same issue with GCC 11 - but not with GCC 10 - on AArch64.
|
Ubuntu 20.04 on all my machines... |
I reproduced on the Ubuntu 18.04 AArch64 machine I have, with GCC 11:
@ggouaillardet what version of glibc do you have on RHEL 7? Maybe binutils and ld too, just to be thorough. |
I can reproduce it on Linux Mint with GCC 11.2.0 and Open MPI debugging disabled. Will take a closer look |
I can't reproduce on my 18.04 laptop with gcc 7.5 (the latest/default). rhel 8.4 with gcc 8.4.1 also no dice. |
Please use GCC 11. I've tried 7-11 and only 11 triggers this. |
This does not only affect RMA window attributes but all attributes seem broken with GCC 11. I put up #10343 but it's more a bandaid than a fix for code that appears fishy to me. I hope someone who remembers the rationale behind it can comment. |
FWIW, I reported this behavior to the GCC folks at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105449 The issue occurs from -O2. I ran a few tests among the GCC versions I have:
|
And the code is just violating C aliasing rules. Pointers types cannot alias int. So when accessing via an "int" you cannot access something which was stored as a void*. (I made a mistake in the GCC bug report because I missed the array was int* and not int** but it was a minor mistake that does not change the aliasing issues). |
Your attributes implementation is broken by GCC 11.
I speculate the code that GCC 11 has optimized into badness is not valid C, but I'm not enough of a language lawyer to know for sure.
The Bug
MCVE
ompi_info
It works with Open-MPI 3...
This code has worked for approximately 8 years with every other MPI implementation I've tried, including this one.
The text was updated successfully, but these errors were encountered: