flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

garlick · 2023-09-21T14:28:59Z

Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in MPI_Finalize()

Environment:

Open MPI v4.1.2, package: Open MPI gyllen@rzwhamo6 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
managed with spack/environment modules
corona cluster at LLNL running TOSS 4 (RHEL 8.8)
primary resource manager and launcher of MPI programs is Flux

Stack traces show

rank 0 is stuck in opal_common_ucx_wait_all_requests ()
rank 1,2 are stuck in opal_pmix.fence_nb()

It appears that UCX makes use of the nonblocking nature of fence_nb() and ucp_disconnect_nb() to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel. But the flux MCA plugin that implements fence_nb() is actually a blocking call. A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters the fence_nb() prevents it from getting out, resulting in the fence never completing and deadlock.

Probably the right solution for users is to use the flux pmix plugin by running with -o pmi=pmix. This is confirmed to resolve this problem. However #8380 effectively converted a segfault due to calling a NULL fence_nb() into a semi-reproduceable hang, arguably not an improvement. Perhaps it would be better to revert it and have UCX treat the lack of a fence_nb() as a fatal runtime error.

Further details: flux-framework/flux-core#5460

Edit: however that was just a theory! Maybe someone who knows ompi/ucx code could confirm or deny?

The text was updated successfully, but these errors were encountered:

rhc54 · 2023-09-21T21:59:26Z

@garlick Perhaps you can refresh my memory. Doesn't your flux pmix plugin use PMI-1 calls behind the PMIx calls to actually implement things? Or has that changed?

Just trying to understand how using that PMI-1 based plugin would solve the fence_nb problem - do you have a way to execute the fence in a non-blocking manner in that plugin? If so, then maybe the correct answer would be to bring that implementation to the OMPI flux MCA plugin.

I do agree, however, that having a non-blocking fence actually be a blocking function is probably not a good thing to do.

garlick · 2023-09-21T22:35:44Z

When you say "flux pmix plugin" are you talking about the flux-pmix plugin to Flux or the flux MCA plugin to the pmix framework in ompi, e.g. "MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.0)"?

If the former, no that embeds your openpmix server and so ompi gets the full PMIx interface including non-blocking fence. When we use that, this problem does not manifest.

If the latter, yes, exactly, it dlopens flux's libpmi.so and makes calls like PMI_init() etc.. There is no quick way to implement non blocking - it would have to embed the PMI-1 wire protocol client (the guts of our libpmi.so) and then I guess integrate its file descriptor with ompi's event loop. That's more work than I'm up for given that this plugin is already removed upstream.

In this issue I was mainly interested in seeing if there is any quick fix we could apply which would avoid this deadlock, which is annoying users. A nice human readable message explaining that UCX requires a non-blocking fence that is not provided by the currently loaded pmix plugin and an abort would be better.

OR, if the UCX disconnects could be forced to complete before the barrier is called, that would be really nice. But I won't hold my breath on that one.

rhc54 · 2023-09-22T01:56:08Z

Yes, thanks - that does help tickle the old brain cells. The problem is that there are a number of user-controlled behaviors that cause non-blocking fences to be executed - it isn't just a UCX finalize issue. Quite frankly, you've just been getting lucky that users haven't hit this more broadly 😄 Attempting to defensively program around the limitation everywhere one might encounter it is probably too big a change for OMPI to pursue.

Only a couple of things you could really do here:

change the fake fence_nb entry in the flux MCA plugin to report a "cannot be used" and abort. Only problem here is that if you don't hit this until "finalize", then the user could lose a lot of execution time only to abort at the very end. Not very appealing, I suspect.
simply remove the flux MCA plugin. In this case, OMPI will politely exit with an error message at MPI_Init if the user forgets to tell flux to use its PMIx support (I gather that is your -o pmi=pmix option), thus ensuring that this problem isn't hit.

HTH

garlick · 2023-09-22T14:11:01Z

simply remove the flux MCA plugin. In this case, OMPI will politely exit with an error message at MPI_Init if the user forgets to tell flux to use its PMIx support (I gather that is your -o pmi=pmix option), thus ensuring that this problem isn't hit.

Will it do that or will it start a bunch of singletons?

rhc54 · 2023-09-22T15:17:08Z

Hmmm...I'm not entirely sure. I guess to be safe you could replace the existing MCA plugin with one that looks for the flux indicators (usually the component looks for some flux-specific envars to determine "I am the one!") and just report "you need to add -o pmi=pmix to your cmd line" before returning an error. The PMIx component should be at a higher priority (or you can lower the flux one if necessary), so if PMIx is seen that will get picked. Then this becomes the fallback and can just advise to use the PMIx support - the returned error will cause OMPI to abort in MPI_Init

garlick · 2023-09-22T16:04:00Z

OTOH in v5.x there will be no flux schizo plugin so maybe it is better to have consistent behavior going forward. I'll check in with our team and make sure nobody objects, but I'm inclined to remove it and let ompi work as designed without launcher specific hacks.

At least bunch-o-singletons is an MPI failure mode well understood by our support staff.

jsquyres · 2023-09-24T13:18:05Z

@garlick Is this resolved?

garlick · 2023-09-24T18:57:35Z

I need to double check with the team this week to see if it's OK, but I was going to propose that we remove the flux plugins in the 4.x branches. (It's already gone in 5.x)

jsquyres · 2023-09-25T17:53:59Z

@garlick Is Flux only used at LLNL?

garlick · 2023-09-25T17:59:48Z

No.

jsquyres · 2023-09-25T18:09:21Z

@garlick Then it would be difficult for us to remove the flux plugins in the middle of a release series.

garlick · 2023-09-25T19:35:00Z

Ah OK. Well that de-complicates things then. This can be closed if you like.

jsquyres added question Target: v4.1.x labels Sep 24, 2023

jsquyres closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

garlick commented Sep 21, 2023 •

edited

Loading

rhc54 commented Sep 21, 2023

garlick commented Sep 21, 2023

rhc54 commented Sep 22, 2023

garlick commented Sep 22, 2023

rhc54 commented Sep 22, 2023

garlick commented Sep 22, 2023

jsquyres commented Sep 24, 2023

garlick commented Sep 24, 2023

jsquyres commented Sep 25, 2023

garlick commented Sep 25, 2023

jsquyres commented Sep 25, 2023

garlick commented Sep 25, 2023

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

Comments

garlick commented Sep 21, 2023 • edited Loading

rhc54 commented Sep 21, 2023

garlick commented Sep 21, 2023

rhc54 commented Sep 22, 2023

garlick commented Sep 22, 2023

rhc54 commented Sep 22, 2023

garlick commented Sep 22, 2023

jsquyres commented Sep 24, 2023

garlick commented Sep 24, 2023

jsquyres commented Sep 25, 2023

garlick commented Sep 25, 2023

jsquyres commented Sep 25, 2023

garlick commented Sep 25, 2023

garlick commented Sep 21, 2023 •

edited

Loading