Skip to content

Comm_spawn/accept using inefficient (and frail) rendezvous mechanism #10110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhc54 opened this issue Mar 11, 2022 · 0 comments
Open

Comm_spawn/accept using inefficient (and frail) rendezvous mechanism #10110

rhc54 opened this issue Mar 11, 2022 · 0 comments
Assignees

Comments

@rhc54
Copy link
Contributor

rhc54 commented Mar 11, 2022

The current OMPI dpm code uses the PMIx publish/lookup mechanism for rendezvous during MPI comm_spawn and connect/accept operations. This mechanism has proven somewhat weak over time and doesn't really scale all that well.

A better mechanism would be to use the PMIx "group" functions as these are designed to scale. We couldn't do this before now because the "group" operations weren't in earlier versions of PMIx and PRRTE - but we now are requiring high enough versions to ensure this support is present.

It would therefore be advisable to update the dpm to take advantage of those faster and more robust operations. The required code would be identical to that used for creating an MPI "session" - a simple group construct (called by all participants) that includes a request to assign a new CID would suffice, and would eliminate a bunch of complex code currently in the dpm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants