Skip to content

UCX OSC violates MPI standard with accumulate + fetch and op #4688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hjelmn opened this issue Jan 9, 2018 · 15 comments
Closed

UCX OSC violates MPI standard with accumulate + fetch and op #4688

hjelmn opened this issue Jan 9, 2018 · 15 comments

Comments

@hjelmn
Copy link
Member

hjelmn commented Jan 9, 2018

Thank you for taking the time to submit an issue!

Background information

The UCX OSC component includes an optimization for MPI_Fetch_and_op(). Unfortunately this optimization leads to incorrect results when mixing MPI_Fetch_and_op() with MPI_Accumulate().

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

master, v3.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from git checkout

Please describe the system on which you are running

  • Operating system/version: Linux nid00020 4.4.49-92.11.1_3.0-cray_ari_c BTL checkpoint friendly #1 SMP Mon Dec 11 23:32:19 UTC 2017 (3.0.99) x86_64 x86_64 x86_64 GNU/Linux
  • Computer hardware: Cray XC-40
  • Network type: Aries

Details of the problem

See the following program. This program will be placed into MTT today:

https://gist.github.com/hjelmn/c8e54a8a6526b939703a6b894f186bab

The program is simple. Each rank performs an MPI_Accumulate() of 1024 int32_t's on its left neighbor and an MPI_Fetch_and_op() on its right neighbor. This is a valid MPI program and it fails with osc/ucx. It passes with osc/rdma.

If this isn't fixed by v3.1.0 I recommend we software-disable the osc/ucx component until it is fixed since it is a correctness issue.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 9, 2018

@artpol84 Please assign to the appropriate person.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 9, 2018

I should point out that osc/rdma fails this test when the ompi_single_accumulate info key or the associated MCA variable is set. That is fine as we are allowed to weaken semantics with info keys. I suggest adding the same info key to osc/ucx to turn the optimization on.

@artpol84 artpol84 removed their assignment Jan 9, 2018
@artpol84
Copy link
Contributor

artpol84 commented Jan 9, 2018

@hjelmn thanks for tracking this.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 9, 2018

No problem. Sorry I didn't get this up before the new year. Got sucked into other projects :-/.

@xinzhao3
Copy link
Contributor

I have found the problem and am working on a PR, will push soon. @bwbarrett

@xinzhao3
Copy link
Contributor

I created a PR to fix this issue: #4728 @hjelmn could you try to see if the bug is fixed?

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2018

Re-opening this issue because we noticed #4731 was closed without merging.

Per discussion on 6 Feb 2018, @artpol84 is going to investigate what happened here.

@gpaulsen
Copy link
Member

gpaulsen commented Feb 6, 2018

Do we need to Pull this to v3.0.x also?

@hjelmn
Copy link
Member Author

hjelmn commented Feb 6, 2018

No. The ucx osc component is only in v3.1 and master.

@artpol84
Copy link
Contributor

artpol84 commented Feb 6, 2018

@jsquyres #4731 was accidentally closed. I reopened it.

@artpol84
Copy link
Contributor

artpol84 commented Feb 6, 2018

@hjelmn I've asked for your review, can you do that?

@bwbarrett
Copy link
Member

@artpol84 / @jladd-mlnx this is currently marked as a blocker on 3.1. Any thoughts?

@artpol84
Copy link
Contributor

@bwbarrett #4731 is merged. We can close this I believe.

@bwbarrett
Copy link
Member

@artpol84 can you confirm and close?

@artpol84
Copy link
Contributor

@hjelmn reviewed and approved #4731. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants