-
Notifications
You must be signed in to change notification settings - Fork 897
flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@garlick Perhaps you can refresh my memory. Doesn't your flux pmix plugin use PMI-1 calls behind the PMIx calls to actually implement things? Or has that changed? Just trying to understand how using that PMI-1 based plugin would solve the I do agree, however, that having a non-blocking fence actually be a blocking function is probably not a good thing to do. |
When you say "flux pmix plugin" are you talking about the flux-pmix plugin to Flux or the flux MCA plugin to the pmix framework in ompi, e.g. "MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.0)"? If the former, no that embeds your openpmix server and so ompi gets the full PMIx interface including non-blocking fence. When we use that, this problem does not manifest. If the latter, yes, exactly, it dlopens flux's In this issue I was mainly interested in seeing if there is any quick fix we could apply which would avoid this deadlock, which is annoying users. A nice human readable message explaining that UCX requires a non-blocking fence that is not provided by the currently loaded pmix plugin and an abort would be better. OR, if the UCX disconnects could be forced to complete before the barrier is called, that would be really nice. But I won't hold my breath on that one. |
Yes, thanks - that does help tickle the old brain cells. The problem is that there are a number of user-controlled behaviors that cause non-blocking fences to be executed - it isn't just a UCX finalize issue. Quite frankly, you've just been getting lucky that users haven't hit this more broadly 😄 Attempting to defensively program around the limitation everywhere one might encounter it is probably too big a change for OMPI to pursue. Only a couple of things you could really do here:
HTH |
Will it do that or will it start a bunch of singletons? |
Hmmm...I'm not entirely sure. I guess to be safe you could replace the existing MCA plugin with one that looks for the |
OTOH in v5.x there will be no flux schizo plugin so maybe it is better to have consistent behavior going forward. I'll check in with our team and make sure nobody objects, but I'm inclined to remove it and let ompi work as designed without launcher specific hacks. At least bunch-o-singletons is an MPI failure mode well understood by our support staff. |
@garlick Is this resolved? |
I need to double check with the team this week to see if it's OK, but I was going to propose that we remove the flux plugins in the 4.x branches. (It's already gone in 5.x) |
@garlick Is Flux only used at LLNL? |
No. |
@garlick Then it would be difficult for us to remove the flux plugins in the middle of a release series. |
Ah OK. Well that de-complicates things then. This can be closed if you like. |
Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in
MPI_Finalize()
Environment:
Stack traces show
opal_common_ucx_wait_all_requests ()
opal_pmix.fence_nb()
It appears that UCX makes use of the nonblocking nature of
fence_nb()
anducp_disconnect_nb()
to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel. But the flux MCA plugin that implementsfence_nb()
is actually a blocking call. A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters thefence_nb()
prevents it from getting out, resulting in the fence never completing and deadlock.Probably the right solution for users is to use the flux pmix plugin by running with
-o pmi=pmix
. This is confirmed to resolve this problem. However #8380 effectively converted a segfault due to calling a NULLfence_nb()
into a semi-reproduceable hang, arguably not an improvement. Perhaps it would be better to revert it and have UCX treat the lack of afence_nb()
as a fatal runtime error.Further details: flux-framework/flux-core#5460
Edit: however that was just a theory! Maybe someone who knows ompi/ucx code could confirm or deny?
The text was updated successfully, but these errors were encountered: