Skip to content

master: MPI_Gather BTL hang (ob1 problem) #4795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Feb 6, 2018 · 23 comments · Fixed by #4852
Closed

master: MPI_Gather BTL hang (ob1 problem) #4795

jsquyres opened this issue Feb 6, 2018 · 23 comments · Fixed by #4852

Comments

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2018

Git bisect shows that 409638b from @bosilca and @thananon is the first bad commit (it changed how ob1 handles out-of-order receives) that is causing MPI_Gather() in IMB to hang for me with both the TCP and usNIC BTLs.

This is 100% reproducible for me. When I run IMB across 2 servers (with ppn=16), it will hang in Gather -- I'm pretty sure it hangs once we transition into the long protocol (i.e., after the 64K results are shown for TCP and after the 16K results are shown for usNIC):

$ mpirun --mca btl usnic,vader,self IMB-MPI1 -npmin 32 Gather
 benchmarks to run Gather 
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part    
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000        14.78        14.83        14.80
            2         1000        14.96        15.01        14.99
            4         1000        15.05        15.12        15.09
            8         1000        15.32        15.38        15.35
           16         1000        15.65        15.71        15.68
           32         1000        16.18        16.24        16.21
           64         1000        18.18        18.24        18.21
          128         1000        20.81        20.87        20.84
          256         1000        24.71        24.80        24.76
          512         1000        34.46        34.62        34.51
         1024         1000        13.87        14.19        14.04
         2048         1000        17.38        17.83        17.62
         4096         1000        49.83        50.23        50.02
         8192         1000       269.86       270.38       270.16
        16384         1000       315.06       315.69       315.43
<hang>
$ mpirun --mca btl tcp,vader,self IMB-MPI1 -npmin 32 Gather
 benchmarks to run Gather 
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part    
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.07         0.04
            1         1000        46.73        46.90        46.82
            2         1000        46.63        46.80        46.72
            4         1000        46.98        47.16        47.06
            8         1000        48.44        48.61        48.52
           16         1000        51.35        51.57        51.46
           32         1000        53.16        53.43        53.33
           64         1000        55.42        55.66        55.54
          128         1000        59.01        59.20        59.10
          256         1000        65.72        65.96        65.83
          512         1000        79.08        79.52        79.26
         1024         1000        67.23        68.24        67.73
         2048         1000        73.87        75.06        74.50
         4096         1000       113.16       114.54       114.16
         8192         1000      1018.50      1020.58      1019.66
        16384         1000      1039.11      1041.27      1040.34
        32768         1000      1285.78      1288.46      1287.30
        65536          640      1881.22      1884.78      1883.41
<hang>

Note that I have 3 usNIC interfaces and 4 IP interfaces. Hence, receiving frags out of order is highly likely. This might be necessary to reproduce the issue...?

Also note that this is only happening on master -- I checked the timeline: 409638b was committed to master after v3.1 branched, and was not PR'ed over.

@bosilca @thananon What additional information can I get to you to help diagnose what is going wrong?

@bosilca
Copy link
Member

bosilca commented Feb 7, 2018

@jsquyres I can't replicate on our machine, despite using multiple TCP links over multiple interfaces. However, while reviewing the code I might have spotted an issue. Can you try to following patch:

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..57e802c969 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -181,9 +181,8 @@ static inline mca_pml_ob1_recv_frag_t* check_cantmatch_for_match(mca_pml_ob1_com
     mca_pml_ob1_recv_frag_t *frag = NULL;

     frag = (mca_pml_ob1_recv_frag_t*)opal_list_get_first(&proc->frags_cant_match);
-    if( (opal_list_get_end(&proc->frags_cant_match) != (opal_list_item_t*)frag) &&
-        (frag->hdr.hdr_match.hdr_seq == proc->expected_sequence) ) {
-        opal_list_remove_item(&proc->frags_cant_match, (opal_list_item_t*)frag);
+    if( frag->hdr.hdr_match.hdr_seq == proc->expected_sequence ) {
+        (void)opal_list_remove_first(&proc->frags_cant_match);
         return frag;
     }
     return NULL;

@thananon
Copy link
Member

thananon commented Feb 7, 2018

I can reproduce this on our machine with
mpirun -np 32 --map-by node --bind-to core -mca btl tcp,self ./IMB-MPI1 -npmin 32 Gather

I tried the patch and it does not work. I will investigate.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 7, 2018

I confirm what @thananon says: even with @bosilca's patch, I still get the same behavior (i.e., IMB Gather hangs).

@jsquyres
Copy link
Member Author

jsquyres commented Feb 7, 2018

I'm actually puzzled (and a little alarmed) that this issue is just showing up now. If possible, it would be good to understand why this was only just discovered.

@thananon
Copy link
Member

thananon commented Feb 7, 2018

So after further digging, here is my findings.

  • It does not happen with 1 interface. (If you select only one dev it's fine.)
  • It happens at the change of protocol like @jsquyres reported.
  • @bosilca suggested me to do pml dump and this is what I get.

with the call stack

#0  0x00007ffff7538a3d in poll () from /usr/lib64/libc.so.6
#1  0x00007ffff6e87363 in poll_dispatch (base=0x662c30, tv=0x7fffffffb760) at poll.c:165
#2  0x00007ffff6e7e95b in opal_libevent2022_event_base_loop (base=0x662c30, flags=2) at event.c:1630
#3  0x00007ffff6e16329 in opal_progress_events () at runtime/opal_progress.c:189
#4  0x00007ffff6e163e2 in opal_progress () at runtime/opal_progress.c:244
#5  0x00007ffff7a85f42 in ompi_request_wait_completion (req=0x796880) at ../ompi/request/request.h:415
#6  0x00007ffff7a85f80 in ompi_request_default_wait (req_ptr=0x7fffffffb940, status=0x7fffffffb920) at request/req_wait.c:42
#7  0x00007ffff7b24f6d in ompi_coll_base_sendrecv_actual (sendbuf=0x7a1340, scount=1, sdatatype=0x613180 <ompi_mpi_double>, dest=1,
    stag=-12, recvbuf=0x7fffffffbba0, rcount=1, rdatatype=0x613180 <ompi_mpi_double>, source=1, rtag=-12, comm=0x7a13c0, status=0x0)
    at base/coll_base_util.c:59
#8  0x00007ffff7b259e1 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffffbba8, rbuf=0x7fffffffbba0, count=1,
    dtype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0, module=0x7abfa0) at base/coll_base_allreduce.c:217
#9  0x00007fffe3176a48 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffffbba8, rbuf=0x7fffffffbba0, count=1,
    dtype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0, module=0x7abfa0) at coll_tuned_decision_fixed.c:66
#10 0x00007ffff7aa3469 in PMPI_Allreduce (sendbuf=0x7fffffffbba8, recvbuf=0x7fffffffbba0, count=1,
    datatype=0x613180 <ompi_mpi_double>, op=0x612580 <ompi_mpi_op_max>, comm=0x7a13c0) at pallreduce.c:110
#11 0x0000000000406661 in IMB_init_buffers_iter ()

So this stuck in Allreduce -> sendrecv.
This is what I get from PML dump on the comm.

[a00.alembert:11577] Communicator MPI COMMUNICATOR 3 SPLIT FROM 0 [0x7a13c0](3) rank 0 recv_seq 24111 num_procs 2 last_probed 0
[a00.alembert:11577] [Rank 1] expected_seq 24111 ompi_proc 0x79c580 send_seq 24111
[a00.alembert:11577] expected specific receives
[a00.alembert:11577] req 0x796880 peer 1 tag -12 addr 0x7fffffffbba0 count 1 datatype MPI_DOUBLE [0x613180] [ ] req_seq 24110

Also, I can reproduce this with just 2 processes.
I will look further.

@thananon
Copy link
Member

thananon commented Feb 7, 2018

It seems like it does not deadlock. Just extremely slow. @jsquyres Can you confirm this?

# Gather

#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 2
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02                                                                                                   1         1000         7.97         7.97         7.97
            2         1000         7.67         7.68         7.67
            4         1000         7.57         7.57         7.57
            8         1000         7.44         7.44         7.44
           16         1000         7.33         7.33         7.33
           32         1000         6.52         6.52         6.52
           64         1000         6.48         6.49         6.48
          128         1000         6.54         6.54         6.54
          256         1000         6.61         6.61         6.61
          512         1000         6.76         6.76         6.76
         1024         1000         7.04         7.04         7.04
         2048         1000         9.27         9.28         9.28
         4096         1000        10.47        10.47        10.47                                                                                                8192         1000        15.62        15.62        15.62
        16384         1000        20.99        21.00        20.99
        32768         1000        32.41        32.42        32.42
        65536          640        50.97        51.03        51.00
       131072          320       923.44       925.99       924.72
  • It does not happen with multiple links. It only happens with multiple interfaces.
  • It does not happen if the eager size is small enough. (changed via MCA param, 32k is fine).
  • Again, it stuck in the initialization phase of IMB_Gather, inside MPI_Allreduce().

@jsquyres
Copy link
Member Author

jsquyres commented Feb 7, 2018

@thananon Asked me in an off-issue email to confirm:

  1. Does the bad behavior occur with only 1 IP/usNIC interface?
    • For the TCP BTL: I confirm that it works fine with mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include vic20 .... But it hangs if I add in a 2nd IP interface, e.g. ... --mca btl_tcp_if_include vic20,vic21 ... (I let it sit there for 15 +minutes before declaring it "hung").
    • For the usNIC BTL: I confirm that it works fine with a single usNIC interface. It also seems to work with 2 usNIC interfaces. However, it hangs if I put in 3 interfaces; i.e., ... --mca btl_usnic_if_include usnic_0,usnic_1,usnic_2 ... (I let it sit there for 45+ minutes before declaring it "hung").
  2. What happens if we lower the eager limit?
    • For the TCP BTL: Lowering btl_tcp_eager_limit to 16384, 8192, and 4096 causes the hang to occur after IMB outputs the result for 16384, 8192, and 4096, respectively.
    • For the usNIC BTL: Lowering btl_usnic_eager_limit to 8192 and 4096 causes the hang to occur after IMB outputs the result for 8192 and 4096, respectively.

@thananon
Copy link
Member

thananon commented Feb 7, 2018

Thank you @jsquyres .

  • Eager limit
    • I can reproduce the problem with the eager size you used but its funny because if I set it to 56, 65, 2000, 32000, 19000 it does not hang.

I will have a discussion with @bosilca later today about this.

@thananon
Copy link
Member

thananon commented Feb 8, 2018

Found the problem.

@jsquyres Please try this patch. If it works I will create a PR.

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
             mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
                                              &frag->hdr.hdr_match,
                                              frag->segments, frag->num_segments,
-                                             hdr->hdr_common.hdr_type, frag);
+                                             frag->hdr.hdr_common.hdr_type, frag);
         } else {
             OB1_MATCHING_UNLOCK(&comm->matching_lock);
         }

@jsquyres
Copy link
Member Author

jsquyres commented Feb 8, 2018

Works... sorta. ☹️

Gather no longer hangs; Bcast now hangs:

$ mpirun --mca btl usnic,vader,self IMB-MPI1 -npmin 32 Bcast
...
#----------------------------------------------------------------
# Benchmarking Bcast 
# #processes = 32 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.03         0.03
            1         1000         7.60         7.63         7.61
            2         1000         7.59         7.69         7.61
            4         1000         7.58         7.60         7.59
            8         1000         7.55         7.58         7.57
           16         1000         7.65         7.68         7.66
           32         1000         7.75         7.78         7.76
           64         1000         7.73         7.75         7.74
          128         1000         7.81         7.84         7.83
          256         1000         7.90         7.92         7.91
          512         1000        10.37        10.39        10.38
         1024         1000        11.76        11.78        11.77
         2048         1000        19.29        19.33        19.32
         4096         1000        27.26        27.31        27.29
         8192         1000        50.04        50.07        50.06
        16384         1000        84.62        84.67        84.65
        32768         1000       153.25       153.32       153.29
        65536          640       296.51       296.59       296.55
       131072          320       564.00       564.23       564.17
<hang>

Same hang at the same place with the TCP BTL.

@thananon
Copy link
Member

thananon commented Feb 9, 2018

I cant reproduce the hang with Bcast on out machine.

# Calling sequence was:

# ./IMB-MPI1 -npmin 32 Bcast

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 32
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.05         0.04
            1         1000        15.17        15.23        15.19
            2         1000        15.07        15.12        15.10
            4         1000        14.85        14.88        14.86
            8         1000        14.89        14.90        14.89
           16         1000        14.94        14.99        14.96
           32         1000        15.25        15.28        15.26
           64         1000        15.07        15.11        15.09
          128         1000        15.69        15.75        15.72
          256         1000        14.88        14.92        14.90
          512         1000        19.37        19.43        19.40
         1024         1000        20.12        20.18        20.15
         2048         1000        69.47        69.60        69.55
         4096         1000       116.52       116.72       116.63
         8192         1000       137.53       137.75       137.70
        16384         1000       191.18       191.46       191.37
        32768         1000       317.89       318.31       318.20
        65536          640      2727.14      2728.38      2728.02
       131072          320     12292.67     12336.91     12319.55
       262144          160     21459.86     21611.05     21554.05
       524288           80     10473.63     10869.19     10794.44
      1048576           40     30942.72     34036.88     33350.64
      2097152           20     43018.00     62538.02     59789.51
      4194304           10     87325.51    131872.12    121733.12


# All processes entering MPI_Finalize

@bosilca
Copy link
Member

bosilca commented Feb 9, 2018

Nice catch @thananon.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 9, 2018

@thananon and I are working off-issue and will post more info soon.

@thananon
Copy link
Member

thananon commented Feb 9, 2018

@jsquyres

After a debugging session with @bosilca , we reverted our commits from #4419 and the problem is still persist with multiple links. We are still not sure how/why do we have the deadlock. It might be totally unrelated problem. We can even reproduce it with just pingpong and 5 TCP links.

I will dig deeper but it is certain that my patch above is needed.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 9, 2018

@thananon @bosilca Do you want to create a PR with the patch we exchanged (off issue) earlier today? It seems like that patch is definitely needed, at a minimum.

@thananon
Copy link
Member

Hey, so this is the situation.

  • This patch is needed for sure.
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 83b7a44902..111c3f71e2 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -348,7 +348,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
             mca_pml_ob1_recv_frag_match_proc(frag->btl, comm_ptr, proc,
                                              &frag->hdr.hdr_match,
                                              frag->segments, frag->num_segments,
-                                             hdr->hdr_common.hdr_type, frag);
+                                             frag->hdr.hdr_common.hdr_type, frag);
         } else {
             OB1_MATCHING_UNLOCK(&comm->matching_lock);
         }

@bwbarrett
Copy link
Member

Just a heads up, there are definitely some bugs in the tcp multilink code. One of the Amazon devs has been working on #4247, which seems to make the code significantly happier...

@bosilca
Copy link
Member

bosilca commented Feb 12, 2018

The patch I sent can be added to the mix. IT's an optimization, that prevents the code from comparing the element to the ghost, an operation that is not necessary as we know that there must be at least one element in the cant_match list.

@bosilca
Copy link
Member

bosilca commented Feb 12, 2018

@bwbarrett all issues with multilink I was aware of are in the TCP connections setup. Did you guys identified other issues ?

@bwbarrett
Copy link
Member

@bosilca, they were all in connection setup, but because of the way the TCP BTL biases eager messages to a single BTL, it was showing up as a hang on the cross over from short to rendezvous messages. Not sure it's related, but wanted to avoid tail chasing if possible...

@thananon
Copy link
Member

I can confirm that the TCP hang with multilink is unrelated to #4419.

@jsquyres I might need your help tracking down the problem from usNIC side.

@gpaulsen
Copy link
Member

master only, not in v3.1.x

thananon pushed a commit to thananon/ompi that referenced this issue Feb 22, 2018
This commit fixes open-mpi#4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
@thananon
Copy link
Member

thananon commented Feb 22, 2018

The PR #4852 passed all test on our machine. @jsquyres Please try on Cisco's.
If everything is good I will create a PR for 3.1.x.

thananon pushed a commit to thananon/ompi that referenced this issue Feb 23, 2018
This commit fixes open-mpi#4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
thananon pushed a commit to thananon/ompi that referenced this issue Feb 27, 2018
This commit fixes open-mpi#4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
thananon pushed a commit to thananon/ompi that referenced this issue Feb 28, 2018
This commit fixes open-mpi#4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

cherry-picked from: 09cba8b
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
thananon pushed a commit to thananon/ompi that referenced this issue Feb 28, 2018
This commit fixes open-mpi#4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

cherry-picked from: 09cba8b
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants