-
Notifications
You must be signed in to change notification settings - Fork 897
master: MPI_Gather BTL hang (ob1 problem) #4795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jsquyres I can't replicate on our machine, despite using multiple TCP links over multiple interfaces. However, while reviewing the code I might have spotted an issue. Can you try to following patch: diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..57e802c969 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -181,9 +181,8 @@ static inline mca_pml_ob1_recv_frag_t* check_cantmatch_for_match(mca_pml_ob1_com
mca_pml_ob1_recv_frag_t *frag = NULL;
frag = (mca_pml_ob1_recv_frag_t*)opal_list_get_first(&proc->frags_cant_match);
- if( (opal_list_get_end(&proc->frags_cant_match) != (opal_list_item_t*)frag) &&
- (frag->hdr.hdr_match.hdr_seq == proc->expected_sequence) ) {
- opal_list_remove_item(&proc->frags_cant_match, (opal_list_item_t*)frag);
+ if( frag->hdr.hdr_match.hdr_seq == proc->expected_sequence ) {
+ (void)opal_list_remove_first(&proc->frags_cant_match);
return frag;
}
return NULL; |
I can reproduce this on our machine with I tried the patch and it does not work. I will investigate. |
I'm actually puzzled (and a little alarmed) that this issue is just showing up now. If possible, it would be good to understand why this was only just discovered. |
So after further digging, here is my findings.
with the call stack
So this stuck in Allreduce -> sendrecv.
Also, I can reproduce this with just 2 processes. |
It seems like it does not deadlock. Just extremely slow. @jsquyres Can you confirm this?
|
@thananon Asked me in an off-issue email to confirm:
|
Found the problem. @jsquyres Please try this patch. If it works I will create a PR. diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
&frag->hdr.hdr_match,
frag->segments, frag->num_segments,
- hdr->hdr_common.hdr_type, frag);
+ frag->hdr.hdr_common.hdr_type, frag);
} else {
OB1_MATCHING_UNLOCK(&comm->matching_lock);
} |
Works... sorta. Gather no longer hangs; Bcast now hangs:
Same hang at the same place with the TCP BTL. |
I cant reproduce the hang with Bcast on out machine.
|
Nice catch @thananon. |
@thananon and I are working off-issue and will post more info soon. |
After a debugging session with @bosilca , we reverted our commits from #4419 and the problem is still persist with multiple links. We are still not sure how/why do we have the deadlock. It might be totally unrelated problem. We can even reproduce it with just pingpong and 5 TCP links. I will dig deeper but it is certain that my patch above is needed. |
Hey, so this is the situation.
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
&frag->hdr.hdr_match,
frag->segments, frag->num_segments,
- hdr->hdr_common.hdr_type, frag);
+ frag->hdr.hdr_common.hdr_type, frag);
} else {
OB1_MATCHING_UNLOCK(&comm->matching_lock);
}
|
Just a heads up, there are definitely some bugs in the tcp multilink code. One of the Amazon devs has been working on #4247, which seems to make the code significantly happier... |
The patch I sent can be added to the mix. IT's an optimization, that prevents the code from comparing the element to the ghost, an operation that is not necessary as we know that there must be at least one element in the cant_match list. |
@bwbarrett all issues with multilink I was aware of are in the TCP connections setup. Did you guys identified other issues ? |
@bosilca, they were all in connection setup, but because of the way the TCP BTL biases eager messages to a single BTL, it was showing up as a hang on the cross over from short to rendezvous messages. Not sure it's related, but wanted to avoid tail chasing if possible... |
master only, not in v3.1.x |
This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. cherry-picked from: 09cba8b Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. cherry-picked from: 09cba8b Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
Git bisect shows that 409638b from @bosilca and @thananon is the first bad commit (it changed how ob1 handles out-of-order receives) that is causing
MPI_Gather()
in IMB to hang for me with both the TCP and usNIC BTLs.This is 100% reproducible for me. When I run IMB across 2 servers (with ppn=16), it will hang in Gather -- I'm pretty sure it hangs once we transition into the long protocol (i.e., after the 64K results are shown for TCP and after the 16K results are shown for usNIC):
Note that I have 3 usNIC interfaces and 4 IP interfaces. Hence, receiving frags out of order is highly likely. This might be necessary to reproduce the issue...?
Also note that this is only happening on master -- I checked the timeline: 409638b was committed to master after v3.1 branched, and was not PR'ed over.
@bosilca @thananon What additional information can I get to you to help diagnose what is going wrong?
The text was updated successfully, but these errors were encountered: