-
Notifications
You must be signed in to change notification settings - Fork 899
Fix tree spawn at scale #6714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tree spawn at scale #6714
Conversation
thx @rhc54 👍 |
Hmmm... I thought we did not approve removing components of a framework in a release branch. |
I can give you another option - in master, we switched back to a single active component for the routed framework. It would change the routed/base files and so the number of lines changed would be larger. Would you prefer that one? |
@karasevb @janjust |
@rhc54 yep I'd prefer the second option you mention. |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/82ef99604f60d30199a17c0908c27ff1 |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/61596b6f7926ac071049654685fc28f5 |
Remove the debruijn component as it changes the daemon's parent process ID, thus breaking the other routed components Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
@hppritcha @gpaulsen I have replaced this with the original change that removes the debruijn component per today's telecon. |
Thanks. |
Just so that we have this decision rationale recorded... We concluded on the webex today (2019-06-04) that the debruijn component has never worked in any 3.0.x, 3.1.x, or 4.0.x release (it did work in various 2.x releases). And since it never worked in any of the 3.x or 4.x releases, we can simply remove that component from those releases without breaking our backwards compatibility guarantees. This is, thankfully, the easiest path forward -- it's basically |
Note that this was merged on the v3.0.x and v3.1.x branches already. CI seems to be borked somehow -- try again. bot:ompi:retest |
The UH CI appears to be offline. Let's see if a new round of CI will fix the issue... bot:retest |
The debrujin component is using an algorithm that doesn't respect the
previously assigned parent ID. This causes the other components to have
their routing trees broken whenever debrujin updates routes. This
happens whenever more than 256 nodes are involved, thus breaking tree
spawn for sizes >= 256
Thanks to @zrss for the report and diagnosis!
Closes #6713
Fixes #6691
Signed-off-by: Ralph Castain rhc@pmix.org