-
Notifications
You must be signed in to change notification settings - Fork 897
Sync to PMIx v2.1.0 #4746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync to PMIx v2.1.0 #4746
Conversation
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Good to go.
👍 |
I am receiving consistent reports of "hangs" of PMIx-based programs using PMIx v2.1.0rc2 when direct launched against Slurm 17.11. I would advise not moving forward until that gets resolved as we have no info as to whether the problem is in the Slurm plugin, or in PMIx itself. Given reassignment of @artpol84 and @karasevb, I'm not sure when the Slurm problem will be investigated. Perhaps someone here can comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks like a fine sync of PMIx 2.1 to the v3.1 branch
I saw @rhc54 's concern in the comments - so I don't know if we want to delay merging until that gets resolved or not.
@rhc54 can you share those reports with me and forward those who has issues to me? |
@rhc54 I’ll appreciate if in future you will let us know ASAP if any issues related to Slurm/PMIx |
From our side we are running Slurm with pmix v2.1 and Slurm on daily basis: @jjhursey I think we should merge it and we will address the issues if any. |
@artpol84 Maybe @rhc54 is thinking about this PMIx user reported issue:
|
@artpol84 I have been letting you know about these problems, but haven't been getting any response. In addition to the mailing list, I have emailed you directly about it. The reports are coming from both HPe and Intel. The behavior is the same in both cases. Note that HPe is not using OMPI, but rather a simple PMIx client test code. Same for Intel. In the Intel case, Slurm 17.11.2.1 is configured with PMIx v2.10rc2. A simple "srun --mpi=pmix_v2" of a PMIx test client that calls PMIx_Init/Finalize hangs until timeout occurs. I don't have a lot of diagnostic output at this time, but have requested more. |
We will resolve this |
Looking into the reports, it appears that the fence may be broken. One possibility that might explain the difference between your tests and what is being reported is - are your tests always using the IB "accelerated" path to do communications? If so, I suspect the other code path is having problems. |
We test all of the cases. |
@bwbarrett @hppritcha: |
Here are some links:
|
Official release is now available: https://github.com/pmix/pmix/releases/tag/v2.1.0 Contains a couple required bug fixes beyond rc2, so I'd recommend updating before commit. |
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Yeah this is good to go. 👍 Thanks! |
@karasevb, with PMIx 2.1.0 going GA, is it possible to refresh this patch? |
@bwbarrett I already updated it - should be ready to go |
Signed-off-by: Boris Karasev karasev.b@gmail.com