-
Notifications
You must be signed in to change notification settings - Fork 897
UCX OSC violates MPI standard with accumulate + fetch and op #4688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@artpol84 Please assign to the appropriate person. |
I should point out that osc/rdma fails this test when the |
@hjelmn thanks for tracking this. |
No problem. Sorry I didn't get this up before the new year. Got sucked into other projects :-/. |
I have found the problem and am working on a PR, will push soon. @bwbarrett |
Do we need to Pull this to v3.0.x also? |
No. The ucx osc component is only in v3.1 and master. |
@hjelmn I've asked for your review, can you do that? |
@artpol84 / @jladd-mlnx this is currently marked as a blocker on 3.1. Any thoughts? |
@bwbarrett #4731 is merged. We can close this I believe. |
@artpol84 can you confirm and close? |
Thank you for taking the time to submit an issue!
Background information
The UCX OSC component includes an optimization for
MPI_Fetch_and_op()
. Unfortunately this optimization leads to incorrect results when mixingMPI_Fetch_and_op()
withMPI_Accumulate()
.What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
master, v3.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from git checkout
Please describe the system on which you are running
Details of the problem
See the following program. This program will be placed into MTT today:
https://gist.github.com/hjelmn/c8e54a8a6526b939703a6b894f186bab
The program is simple. Each rank performs an
MPI_Accumulate()
of 1024 int32_t's on its left neighbor and anMPI_Fetch_and_op()
on its right neighbor. This is a valid MPI program and it fails with osc/ucx. It passes with osc/rdma.If this isn't fixed by v3.1.0 I recommend we software-disable the osc/ucx component until it is fixed since it is a correctness issue.
The text was updated successfully, but these errors were encountered: