Skip to content

Fix tree spawn at scale #6714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 7, 2019
Merged

Fix tree spawn at scale #6714

merged 2 commits into from
Jun 7, 2019

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 30, 2019

The debrujin component is using an algorithm that doesn't respect the
previously assigned parent ID. This causes the other components to have
their routing trees broken whenever debrujin updates routes. This
happens whenever more than 256 nodes are involved, thus breaking tree
spawn for sizes >= 256

Thanks to @zrss for the report and diagnosis!

Closes #6713
Fixes #6691

Signed-off-by: Ralph Castain rhc@pmix.org

@rhc54 rhc54 added the bug label May 30, 2019
@rhc54 rhc54 added this to the v4.0.2 milestone May 30, 2019
@rhc54 rhc54 requested a review from jsquyres May 30, 2019 13:14
@rhc54 rhc54 self-assigned this May 30, 2019
@zrss
Copy link

zrss commented May 30, 2019

thx @rhc54 👍

@hppritcha
Copy link
Member

Hmmm... I thought we did not approve removing components of a framework in a release branch.
At least that's what I was told when I tried to remove the UCT BTL.

@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2019

I can give you another option - in master, we switched back to a single active component for the routed framework. It would change the routed/base files and so the number of lines changed would be larger. Would you prefer that one?

@artpol84
Copy link
Contributor

@karasevb @janjust
Please check if this fixes the issues we observe on some systems.
@jladd-mlnx, FYI.

@hppritcha
Copy link
Member

@rhc54 yep I'd prefer the second option you mention.

@ibm-ompi
Copy link

ibm-ompi commented Jun 3, 2019

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/82ef99604f60d30199a17c0908c27ff1

@ibm-ompi
Copy link

ibm-ompi commented Jun 3, 2019

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/61596b6f7926ac071049654685fc28f5

rhc54 added 2 commits June 4, 2019 09:49
Remove the debruijn component as it changes the daemon's parent
process ID, thus breaking the other routed components

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 4, 2019

@hppritcha @gpaulsen I have replaced this with the original change that removes the debruijn component per today's telecon.

@gpaulsen
Copy link
Member

gpaulsen commented Jun 4, 2019

Thanks.

@jsquyres
Copy link
Member

jsquyres commented Jun 4, 2019

Just so that we have this decision rationale recorded...

We concluded on the webex today (2019-06-04) that the debruijn component has never worked in any 3.0.x, 3.1.x, or 4.0.x release (it did work in various 2.x releases). And since it never worked in any of the 3.x or 4.x releases, we can simply remove that component from those releases without breaking our backwards compatibility guarantees.

This is, thankfully, the easiest path forward -- it's basically git rm -rf ... in each of the release branches, and no one has to spend any (further) time debugging or developing. 👍

@jsquyres
Copy link
Member

jsquyres commented Jun 4, 2019

Note that this was merged on the v3.0.x and v3.1.x branches already.

CI seems to be borked somehow -- try again.

bot:ompi:retest

@gpaulsen
Copy link
Member

gpaulsen commented Jun 5, 2019

The UH CI appears to be offline. Let's see if a new round of CI will fix the issue...

bot:retest

@gpaulsen gpaulsen requested a review from hppritcha June 7, 2019 19:11
@gpaulsen gpaulsen requested review from hppritcha and removed request for jsquyres June 7, 2019 19:12
@gpaulsen gpaulsen merged commit 0cd5a5a into open-mpi:v4.0.x Jun 7, 2019
@rhc54 rhc54 deleted the cmr40/routed branch November 27, 2019 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants