-
Notifications
You must be signed in to change notification settings - Fork 899
ORTE has lost communication with a remote daemon. #6618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
how could I get more infomation about this error? |
i met a similar case, but i found my problem is related to the number of
it works fine, but once i increased the np number to 200, it comes out the problem
and i can confirm these nodes can connect each other as i switch the hostfile to select nodes that mpirun run the mpi hostfile sample
... hope that someone know how to debug it 👽 |
It's possible to configure the openmpi to not terminate the job if ORTE has lost communication with a remote daemon? |
I add |
If Open MPI hasn't even finished launching yet and something goes wrong, that's a fatal error -- we don't have any other option except to abort. Your job is relatively small -- 8 nodes. You shouldn't be bumping up against any TCP sockets or file descriptor limits. I'm assuming you're launching in Kubernetes -- is each one of these |
@tingweiwu @jsquyres , hi, as to my problem, mpirun hostname failed with np 256, i found out a workaround, i.e. run mpirun with |
I'm going to pick up this issue (instead of opening a new one) as I think it is the same issue I've been chasing today. I'm using a recent HEAD of v4.0.x (390e0bc) on a bare metal system of 4 nodes if I force the nodemap to be communicated instead of placed on the command line.
The problem goes away if I do either: There are a few problems here to work through on the v4.0.x branch (note that this works correctly on
Now we have a root cause:
Repair options
@rhc54 You know this code probably the best. Do you foresee any issue with going with option (3)? |
Either two or three is fine - no need for multi-select anyway |
* Fix open-mpi#6618 - See comments on Issue open-mpi#6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Note: I checked the v3.0.x and v3.1.x branches and they do not show this specific problem. It must be unique to the v4.0.x branch. |
Yes, it is - the reason is that the v4.0 branch came during a point in time where we were trying to use libfabric for ORTE collectives during launch. This meant that the mgmt Ethernet OOB connections required a different routing pattern from the libfabric ones - and hence we made the routed framework multi-select. We decided not to pursue that path after the v4.0 branch and offered to remove that code from the release branch, but it was deemed too big a change. If it were me, I'd just make routed single-select and completely resolve the problem. Nothing will break because you cannot use the RML/OFI component unless you explicitly request it. However, there is nothing wrong with this approach as it effectively makes the routed framework single-select on the remote orteds by setting the MCA param to a single component. |
@tingweiwu This fix was just merged to v4.0.x for eventual inclusion in v4.0.2. Could you please verify that this is fixed on v4.0.x branch? |
mwheinz verified this appears to be fixed in master and v4.0.x ( #7087 ). |
* Fix open-mpi#6618 - See comments on Issue open-mpi#6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit ca0f4d4d32bff55e04841dea5055147661866b83)
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Install Open MPI
Please describe the system on which you are running
ubuntu16.04
V100GPU+InfiniBand
docker cni networker
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
I got this error frequently, not everytime. but it occures both when the process starting or running.
I have check the network bewteen
i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
andi39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0
is ok, and OOM haven't occured.do you have any suggestion to find the reason?
The text was updated successfully, but these errors were encountered: