master: MPI_Gather BTL hang (ob1 problem) #4795

jsquyres · 2018-02-06T22:04:51Z

Git bisect shows that 409638b from @bosilca and @thananon is the first bad commit (it changed how ob1 handles out-of-order receives) that is causing MPI_Gather() in IMB to hang for me with both the TCP and usNIC BTLs.

This is 100% reproducible for me. When I run IMB across 2 servers (with ppn=16), it will hang in Gather -- I'm pretty sure it hangs once we transition into the long protocol (i.e., after the 64K results are shown for TCP and after the 16K results are shown for usNIC):

$ mpirun --mca btl usnic,vader,self IMB-MPI1 -npmin 32 Gather
 benchmarks to run Gather 
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part    
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000        14.78        14.83        14.80
            2         1000        14.96        15.01        14.99
            4         1000        15.05        15.12        15.09
            8         1000        15.32        15.38        15.35
           16         1000        15.65        15.71        15.68
           32         1000        16.18        16.24        16.21
           64         1000        18.18        18.24        18.21
          128         1000        20.81        20.87        20.84
          256         1000        24.71        24.80        24.76
          512         1000        34.46        34.62        34.51
         1024         1000        13.87        14.19        14.04
         2048         1000        17.38        17.83        17.62
         4096         1000        49.83        50.23        50.02
         8192         1000       269.86       270.38       270.16
        16384         1000       315.06       315.69       315.43
<hang>

$ mpirun --mca btl tcp,vader,self IMB-MPI1 -npmin 32 Gather
 benchmarks to run Gather 
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part    
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.07         0.04
            1         1000        46.73        46.90        46.82
            2         1000        46.63        46.80        46.72
            4         1000        46.98        47.16        47.06
            8         1000        48.44        48.61        48.52
           16         1000        51.35        51.57        51.46
           32         1000        53.16        53.43        53.33
           64         1000        55.42        55.66        55.54
          128         1000        59.01        59.20        59.10
          256         1000        65.72        65.96        65.83
          512         1000        79.08        79.52        79.26
         1024         1000        67.23        68.24        67.73
         2048         1000        73.87        75.06        74.50
         4096         1000       113.16       114.54       114.16
         8192         1000      1018.50      1020.58      1019.66
        16384         1000      1039.11      1041.27      1040.34
        32768         1000      1285.78      1288.46      1287.30
        65536          640      1881.22      1884.78      1883.41
<hang>

Note that I have 3 usNIC interfaces and 4 IP interfaces. Hence, receiving frags out of order is highly likely. This might be necessary to reproduce the issue...?

Also note that this is only happening on master -- I checked the timeline: 409638b was committed to master after v3.1 branched, and was not PR'ed over.

@bosilca @thananon What additional information can I get to you to help diagnose what is going wrong?

The text was updated successfully, but these errors were encountered:

bosilca · 2018-02-07T02:19:05Z

@jsquyres I can't replicate on our machine, despite using multiple TCP links over multiple interfaces. However, while reviewing the code I might have spotted an issue. Can you try to following patch:

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..57e802c969 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -181,9 +181,8 @@ static inline mca_pml_ob1_recv_frag_t* check_cantmatch_for_match(mca_pml_ob1_com
     mca_pml_ob1_recv_frag_t *frag = NULL;

     frag = (mca_pml_ob1_recv_frag_t*)opal_list_get_first(&proc->frags_cant_match);
-    if( (opal_list_get_end(&proc->frags_cant_match) != (opal_list_item_t*)frag) &&
-        (frag->hdr.hdr_match.hdr_seq == proc->expected_sequence) ) {
-        opal_list_remove_item(&proc->frags_cant_match, (opal_list_item_t*)frag);
+    if( frag->hdr.hdr_match.hdr_seq == proc->expected_sequence ) {
+        (void)opal_list_remove_first(&proc->frags_cant_match);
         return frag;
     }
     return NULL;

thananon · 2018-02-07T15:53:13Z

I can reproduce this on our machine with
mpirun -np 32 --map-by node --bind-to core -mca btl tcp,self ./IMB-MPI1 -npmin 32 Gather

I tried the patch and it does not work. I will investigate.

jsquyres · 2018-02-07T17:09:23Z

I confirm what @thananon says: even with @bosilca's patch, I still get the same behavior (i.e., IMB Gather hangs).

jsquyres · 2018-02-07T17:10:40Z

I'm actually puzzled (and a little alarmed) that this issue is just showing up now. If possible, it would be good to understand why this was only just discovered.

thananon · 2018-02-07T17:30:03Z

So after further digging, here is my findings.

It does not happen with 1 interface. (If you select only one dev it's fine.)
It happens at the change of protocol like @jsquyres reported.
@bosilca suggested me to do pml dump and this is what I get.

with the call stack

#0  0x00007ffff7538a3d in poll () from /usr/lib64/libc.so.6
#1  0x00007ffff6e87363 in poll_dispatch (base=0x662c30, tv=0x7fffffffb760) at poll.c:165
#2  0x00007ffff6e7e95b in opal_libevent2022_event_base_loop (base=0x662c30, flags=2) at event.c:1630
#3  0x00007ffff6e16329 in opal_progress_events () at runtime/opal_progress.c:189
#4  0x00007ffff6e163e2 in opal_progress () at runtime/opal_progress.c:244
#5  0x00007ffff7a85f42 in ompi_request_wait_completion (req=0x796880) at ../ompi/request/request.h:415
#6  0x00007ffff7a85f80 in ompi_request_default_wait (req_ptr=0x7fffffffb940, status=0x7fffffffb920) at request/req_wait.c:42
#7  0x00007ffff7b24f6d in ompi_coll_base_sendrecv_actual (sendbuf=0x7a1340, scount=1, sdatatype=0x613180 <ompi_mpi_double>, dest=1,
    stag=-12, recvbuf=0x7fffffffbba0, rcount=1, rdatatype=0x613180 <ompi_mpi_double>, source=1, rtag=-12, comm=0x7a13c0, status=0x0)
    at base/coll_base_util.c:59
#8  0x00007ffff7b259e1 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffffbba8, rbuf=0x7fffffffbba0, count=1,
    dtype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0, module=0x7abfa0) at base/coll_base_allreduce.c:217
#9  0x00007fffe3176a48 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffffbba8, rbuf=0x7fffffffbba0, count=1,
    dtype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0, module=0x7abfa0) at coll_tuned_decision_fixed.c:66
#10 0x00007ffff7aa3469 in PMPI_Allreduce (sendbuf=0x7fffffffbba8, recvbuf=0x7fffffffbba0, count=1,
    datatype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0) at pallreduce.c:110
#11 0x0000000000406661 in IMB_init_buffers_iter ()

So this stuck in Allreduce -> sendrecv.
This is what I get from PML dump on the comm.

[a00.alembert:11577] Communicator MPI COMMUNICATOR 3 SPLIT FROM 0 [0x7a13c0](3) rank 0 recv_seq 24111 num_procs 2 last_probed 0
[a00.alembert:11577] [Rank 1] expected_seq 24111 ompi_proc 0x79c580 send_seq 24111
[a00.alembert:11577] expected specific receives
[a00.alembert:11577] req 0x796880 peer 1 tag -12 addr 0x7fffffffbba0 count 1 datatype MPI_DOUBLE [0x613180] [ ] req_seq 24110

Also, I can reproduce this with just 2 processes.
I will look further.

thananon · 2018-02-07T18:41:43Z

It seems like it does not deadlock. Just extremely slow. @jsquyres Can you confirm this?

# Gather

#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 2
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02                                                                                                   1         1000         7.97         7.97         7.97
            2         1000         7.67         7.68         7.67
            4         1000         7.57         7.57         7.57
            8         1000         7.44         7.44         7.44
           16         1000         7.33         7.33         7.33
           32         1000         6.52         6.52         6.52
           64         1000         6.48         6.49         6.48
          128         1000         6.54         6.54         6.54
          256         1000         6.61         6.61         6.61
          512         1000         6.76         6.76         6.76
         1024         1000         7.04         7.04         7.04
         2048         1000         9.27         9.28         9.28
         4096         1000        10.47        10.47        10.47                                                                                                8192         1000        15.62        15.62        15.62
        16384         1000        20.99        21.00        20.99
        32768         1000        32.41        32.42        32.42
        65536          640        50.97        51.03        51.00
       131072          320       923.44       925.99       924.72

It does not happen with multiple links. It only happens with multiple interfaces.
It does not happen if the eager size is small enough. (changed via MCA param, 32k is fine).
Again, it stuck in the initialization phase of IMB_Gather, inside MPI_Allreduce().

jsquyres · 2018-02-07T19:49:40Z

@thananon Asked me in an off-issue email to confirm:

Does the bad behavior occur with only 1 IP/usNIC interface?
- For the TCP BTL: I confirm that it works fine with mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include vic20 .... But it hangs if I add in a 2nd IP interface, e.g. ... --mca btl_tcp_if_include vic20,vic21 ... (I let it sit there for 15 +minutes before declaring it "hung").
- For the usNIC BTL: I confirm that it works fine with a single usNIC interface. It also seems to work with 2 usNIC interfaces. However, it hangs if I put in 3 interfaces; i.e., ... --mca btl_usnic_if_include usnic_0,usnic_1,usnic_2 ... (I let it sit there for 45+ minutes before declaring it "hung").
What happens if we lower the eager limit?
- For the TCP BTL: Lowering btl_tcp_eager_limit to 16384, 8192, and 4096 causes the hang to occur after IMB outputs the result for 16384, 8192, and 4096, respectively.
- For the usNIC BTL: Lowering btl_usnic_eager_limit to 8192 and 4096 causes the hang to occur after IMB outputs the result for 8192 and 4096, respectively.

thananon · 2018-02-07T20:14:05Z

Thank you @jsquyres .

Eager limit
- I can reproduce the problem with the eager size you used but its funny because if I set it to 56, 65, 2000, 32000, 19000 it does not hang.

I will have a discussion with @bosilca later today about this.

thananon · 2018-02-08T20:25:57Z

Found the problem.

@jsquyres Please try this patch. If it works I will create a PR.

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
             mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
                                              &frag->hdr.hdr_match,
                                              frag->segments, frag->num_segments,
-                                             hdr->hdr_common.hdr_type, frag);
+                                             frag->hdr.hdr_common.hdr_type, frag);
         } else {
             OB1_MATCHING_UNLOCK(&comm->matching_lock);
         }

jsquyres · 2018-02-08T21:00:49Z

Works... sorta. ☹️

Gather no longer hangs; Bcast now hangs:

$ mpirun --mca btl usnic,vader,self IMB-MPI1 -npmin 32 Bcast
...
#----------------------------------------------------------------
# Benchmarking Bcast 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.03         0.03
            1         1000         7.60         7.63         7.61
            2         1000         7.59         7.69         7.61
            4         1000         7.58         7.60         7.59
            8         1000         7.55         7.58         7.57
           16         1000         7.65         7.68         7.66
           32         1000         7.75         7.78         7.76
           64         1000         7.73         7.75         7.74
          128         1000         7.81         7.84         7.83
          256         1000         7.90         7.92         7.91
          512         1000        10.37        10.39        10.38
         1024         1000        11.76        11.78        11.77
         2048         1000        19.29        19.33        19.32
         4096         1000        27.26        27.31        27.29
         8192         1000        50.04        50.07        50.06
        16384         1000        84.62        84.67        84.65
        32768         1000       153.25       153.32       153.29
        65536          640       296.51       296.59       296.55
       131072          320       564.00       564.23       564.17
<hang>

Same hang at the same place with the TCP BTL.

thananon · 2018-02-09T14:38:11Z

I cant reproduce the hang with Bcast on out machine.

# Calling sequence was:

# ./IMB-MPI1 -npmin 32 Bcast

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 32
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.05         0.04
            1         1000        15.17        15.23        15.19
            2         1000        15.07        15.12        15.10
            4         1000        14.85        14.88        14.86
            8         1000        14.89        14.90        14.89
           16         1000        14.94        14.99        14.96
           32         1000        15.25        15.28        15.26
           64         1000        15.07        15.11        15.09
          128         1000        15.69        15.75        15.72
          256         1000        14.88        14.92        14.90
          512         1000        19.37        19.43        19.40
         1024         1000        20.12        20.18        20.15
         2048         1000        69.47        69.60        69.55
         4096         1000       116.52       116.72       116.63
         8192         1000       137.53       137.75       137.70
        16384         1000       191.18       191.46       191.37
        32768         1000       317.89       318.31       318.20
        65536          640      2727.14      2728.38      2728.02
       131072          320     12292.67     12336.91     12319.55
       262144          160     21459.86     21611.05     21554.05
       524288           80     10473.63     10869.19     10794.44
      1048576           40     30942.72     34036.88     33350.64
      2097152           20     43018.00     62538.02     59789.51
      4194304           10     87325.51    131872.12    121733.12


# All processes entering MPI_Finalize

bosilca · 2018-02-09T16:16:30Z

Nice catch @thananon.

jsquyres · 2018-02-09T17:49:14Z

@thananon and I are working off-issue and will post more info soon.

thananon · 2018-02-09T23:02:54Z

@jsquyres

After a debugging session with @bosilca , we reverted our commits from #4419 and the problem is still persist with multiple links. We are still not sure how/why do we have the deadlock. It might be totally unrelated problem. We can even reproduce it with just pingpong and 5 TCP links.

I will dig deeper but it is certain that my patch above is needed.

jsquyres · 2018-02-09T23:45:59Z

@thananon @bosilca Do you want to create a PR with the patch we exchanged (off issue) earlier today? It seems like that patch is definitely needed, at a minimum.

thananon · 2018-02-12T15:15:21Z

Hey, so this is the situation.

This patch is needed for sure.

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
             mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
                                              &frag->hdr.hdr_match,
                                              frag->segments, frag->num_segments,
-                                             hdr->hdr_common.hdr_type, frag);
+                                             frag->hdr.hdr_common.hdr_type, frag);
         } else {
             OB1_MATCHING_UNLOCK(&comm->matching_lock);
         }

The patch off-issue I sent to @jsquyres is a reversal of pml/ob1: match callback will now queue wrong sequence frag and return. #4419 . This does fix the problem on Cisco's machine.
We tried reversing pml/ob1: match callback will now queue wrong sequence frag and return. #4419 on our machine and we still have deadlock with 8 links.
I'm trying to see if this deadlock is caused by pml/ob1: match callback will now queue wrong sequence frag and return. #4419 or it is some independent problem when we have multilink. I will report back.
After I figure everything out I will have a PR.

bwbarrett · 2018-02-12T15:35:14Z

Just a heads up, there are definitely some bugs in the tcp multilink code. One of the Amazon devs has been working on #4247, which seems to make the code significantly happier...

bosilca · 2018-02-12T16:06:27Z

The patch I sent can be added to the mix. IT's an optimization, that prevents the code from comparing the element to the ghost, an operation that is not necessary as we know that there must be at least one element in the cant_match list.

bosilca · 2018-02-12T16:07:45Z

@bwbarrett all issues with multilink I was aware of are in the TCP connections setup. Did you guys identified other issues ?

bwbarrett · 2018-02-12T16:14:45Z

@bosilca, they were all in connection setup, but because of the way the TCP BTL biases eager messages to a single BTL, it was showing up as a hang on the cross over from short to rendezvous messages. Not sure it's related, but wanted to avoid tail chasing if possible...

thananon · 2018-02-13T18:07:36Z

I can confirm that the TCP hang with multilink is unrelated to #4419.

@jsquyres I might need your help tracking down the problem from usNIC side.

gpaulsen · 2018-02-20T16:14:45Z

master only, not in v3.1.x

This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>

thananon · 2018-02-22T21:15:01Z

The PR #4852 passed all test on our machine. @jsquyres Please try on Cisco's.
If everything is good I will create a PR for 3.1.x.

This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>

This commit fixes open-mpi#4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. cherry-picked from: 09cba8b Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>

jsquyres added bug Severity: blocker labels Feb 6, 2018

jsquyres assigned bosilca and thananon Feb 6, 2018

thananon mentioned this issue Feb 22, 2018

pml/ob1: fixed out of sequence bug. #4852

Merged

bosilca mentioned this issue Feb 26, 2018

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Closed

bosilca closed this as completed in #4852 Feb 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master: MPI_Gather BTL hang (ob1 problem) #4795

master: MPI_Gather BTL hang (ob1 problem) #4795

jsquyres commented Feb 6, 2018

bosilca commented Feb 7, 2018

thananon commented Feb 7, 2018

jsquyres commented Feb 7, 2018

jsquyres commented Feb 7, 2018

thananon commented Feb 7, 2018

thananon commented Feb 7, 2018 •

edited

Loading

jsquyres commented Feb 7, 2018

thananon commented Feb 7, 2018

thananon commented Feb 8, 2018

jsquyres commented Feb 8, 2018

thananon commented Feb 9, 2018 •

edited

Loading

bosilca commented Feb 9, 2018

jsquyres commented Feb 9, 2018

thananon commented Feb 9, 2018

jsquyres commented Feb 9, 2018

thananon commented Feb 12, 2018

bwbarrett commented Feb 12, 2018

bosilca commented Feb 12, 2018

bosilca commented Feb 12, 2018

bwbarrett commented Feb 12, 2018

thananon commented Feb 13, 2018

gpaulsen commented Feb 20, 2018

thananon commented Feb 22, 2018 •

edited

Loading

master: MPI_Gather BTL hang (ob1 problem) #4795

master: MPI_Gather BTL hang (ob1 problem) #4795

Comments

jsquyres commented Feb 6, 2018

bosilca commented Feb 7, 2018

thananon commented Feb 7, 2018

jsquyres commented Feb 7, 2018

jsquyres commented Feb 7, 2018

thananon commented Feb 7, 2018

thananon commented Feb 7, 2018 • edited Loading

jsquyres commented Feb 7, 2018

thananon commented Feb 7, 2018

thananon commented Feb 8, 2018

jsquyres commented Feb 8, 2018

thananon commented Feb 9, 2018 • edited Loading

bosilca commented Feb 9, 2018

jsquyres commented Feb 9, 2018

thananon commented Feb 9, 2018

jsquyres commented Feb 9, 2018

thananon commented Feb 12, 2018

bwbarrett commented Feb 12, 2018

bosilca commented Feb 12, 2018

bosilca commented Feb 12, 2018

bwbarrett commented Feb 12, 2018

thananon commented Feb 13, 2018

gpaulsen commented Feb 20, 2018

thananon commented Feb 22, 2018 • edited Loading

thananon commented Feb 7, 2018 •

edited

Loading

thananon commented Feb 9, 2018 •

edited

Loading

thananon commented Feb 22, 2018 •

edited

Loading