Skip to content

Issues running with Open UCX 1.4 on Cray XC40 #6084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
devreal opened this issue Nov 15, 2018 · 22 comments
Closed

Issues running with Open UCX 1.4 on Cray XC40 #6084

devreal opened this issue Nov 15, 2018 · 22 comments

Comments

@devreal
Copy link
Contributor

devreal commented Nov 15, 2018

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI git master (592e2cc)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

configured using with support for Open UCX 1.4 (downloaded from openucx.com) using configure flags --with-cray-pmi --enable-debug --with-ucx

Please describe the system on which you are running

  • Operating system/version: Cray XC40
  • Computer hardware:
  • Network type:

Details of the problem

Trying to run a job on that machine leads to the following error and eventually crash:

$ mpirun -n 2 -N 2 --oversubscribe ./mpi/one-sided/osu_get_latency
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
[nid07057:31456:0:31456] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[nid07057:31455:0:31455] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1542250961.930000] [nid07057:31456:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542250961.930037] [nid07057:31456:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
[1542250961.930022] [nid07057:31455:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542250961.930055] [nid07057:31455:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
   12  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
   13  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
   14  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
   15  ./mpi/one-sided/osu_get_latency() [0x401450]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
   17  ./mpi/one-sided/osu_get_latency() [0x401699]
===================
   12  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
   13  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
   14  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
   15  ./mpi/one-sided/osu_get_latency() [0x401450]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
   17  ./mpi/one-sided/osu_get_latency() [0x401699]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 31456 on node 7057 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

This issue could be related to #5973 (as it's the same machine).

Interestingly, I seem to be unable to disable the UCX plm using --mca pml_ucx_priority 0 (assuming that that is the right way to do it), it does not change the outcome. With Open MPI configured without UCX I am able to run Open MPI applications on that machine (using --oversubscribe).

Another interesting observation is that when running the Open MPI + UCX under DDT I get the following error:

[1542291969.934067] [nid02068:7234 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291969.934601] [nid02068:7235 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
# OSU MPI_Get latency Test v5.3.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
[1542291993.779432] [nid02068:7235 :1] ugni_udt_iface.c:119  UCX  ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4

[1542291993.780001] [nid02068:7234 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xb80) for ucp_am_bufs failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291993.780991] [nid02068:7234 :1] ugni_udt_iface.c:119  UCX  ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4

As suggested, here is the output of ipcs -l on the node:

$ aprun -n 1 ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

I am able to debug Open MPI applications if Open MPI was built without support for UCX.

@jsquyres
Copy link
Member

@bosilca @hppritcha Who should this Cray UCX issue be assigned to?

@hppritcha
Copy link
Member

Does the job work okay if you use aprun as the job launcher rather than mpirun? Also could you add
your OpenUCX configure options (if you built it yourself), and which version of UCX you're using?

@hjelmn
Copy link
Member

hjelmn commented Nov 17, 2018

I recommend not using UCX on a Cray. You will get better performance with the built-in uGNI support.

@devreal
Copy link
Contributor Author

devreal commented Nov 19, 2018

@hppritcha I was using the official 1.4 release downloaded from the website and configured using:

../configure --with-ugni=/opt/cray//ugni/6.0.14-6.0.5.0_16.9__g19583bb.ari/
[...]
configure: Supported transports: ,cray-ugni,cma,xpmem
[...]

I can post the full config.log if that helps.

When running under mpirun, I see the segfault during component selection:

$ OMPI_MCA_pml_base_verbose=100 mpirun -n 2 -N 1 ./a.out
[nid00572:48382] mca: base: components_open: found loaded component monitoring
[nid00572:48382] mca: base: components_open: component monitoring open function successful
[nid00572:48382] mca: base: components_open: found loaded component ob1
[nid00573:48798] mca: base: components_open: found loaded component monitoring
[nid00573:48798] mca: base: components_open: component monitoring open function successful
[nid00573:48798] mca: base: components_open: found loaded component ob1
[nid00572:48382] mca: base: components_open: component ob1 open function successful
[nid00572:48382] mca: base: components_open: found loaded component ucx
[nid00573:48798] mca: base: components_open: component ob1 open function successful
[nid00573:48798] mca: base: components_open: found loaded component ucx
Mon Nov 19 16:18:06 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Mon Nov 19 16:18:06 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
[1542640686.625814] [nid00572:48382:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542640686.625858] [nid00572:48382:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
Mon Nov 19 16:18:06 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Mon Nov 19 16:18:06 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
[nid00572:48382:0:48382] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1542640686.625534] [nid00573:48798:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542640686.625583] [nid00573:48798:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
[nid00573:48798:0:48798] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfb1a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfb3f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc767ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc767945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc760369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
   12  ./mpi_progress() [0x401baf]
   13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
   14  ./mpi_progress() [0x400fd9]
===================
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfb1a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfb3f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc767ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc767945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc760369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
   12  ./mpi_progress() [0x401baf]
   13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
   14  ./mpi_progress() [0x400fd9]
===================

When running under aprun I see an error printed during component selection but then Open MPI falls back to ob1:

$ OMPI_MCA_pml_base_verbose=100 PMI_NO_FORK=1 aprun -n 2 -N 1 ./a.out
[...]
[nid00572:48664] mca: base: components_open: found loaded component monitoring
[nid00572:48664] mca: base: components_open: component monitoring open function successful
[nid00572:48664] mca: base: components_open: found loaded component ob1
[nid00572:48664] mca: base: components_open: component ob1 open function successful
[nid00572:48664] mca: base: components_open: found loaded component ucx
[nid00572:48664] mca: base: components_open: component ucx open function successful
[nid00572:48664] select: component v not in the include list
[nid00572:48664] select: component monitoring not in the include list
[nid00572:48664] select: initializing pml component ob1
[nid00572:48664] select: init returned priority 20
[nid00572:48664] select: initializing pml component ucx
[1542640902.398300] [nid00573:49080:0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542640902.398800] [nid00572:48664:0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[nid00572:48664] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE
[nid00573:49080] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE
[nid00573:49080] select: init returned failure for component ucx
[nid00573:49080] selected ob1 best priority 20
[nid00573:49080] select: component ob1 selected
[nid00573:49080] mca: base: close: component v closed
[nid00573:49080] mca: base: close: unloading component v
[nid00573:49080] mca: base: close: unloading component monitoring
[nid00572:48664] select: init returned failure for component ucx
[nid00572:48664] selected ob1 best priority 20
[nid00572:48664] select: component ob1 selected
[nid00572:48664] mca: base: close: component v closed
[nid00572:48664] mca: base: close: unloading component v
[nid00572:48664] mca: base: close: unloading component monitoring
[nid00573:49080] mca: base: close: component ucx closed
[nid00573:49080] mca: base: close: unloading component ucx
[nid00572:48664] mca: base: close: component ucx closed
[nid00572:48664] mca: base: close: unloading component ucx
[nid00572:48664] check:select: checking my pml ob1 against rank=0 pml ob1
[nid00573:49080] check:select: rank=0

@hppritcha
Copy link
Member

I think you may have been zapped by 4eeb415. Could you try and test with an older Open MPI release like 3.1.2?

@devreal
Copy link
Contributor Author

devreal commented Nov 20, 2018

@hppritcha I tried with Open MPI 3.1.2 and Open UCX 1.4 but the error is the same. I got a stack trace from a core dump though:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  init_device_list () at ../../../src/uct/ugni/base/ugni_device.c:198
198	    if (-1 != inf->num_devices) {
[Current thread is 1 (Thread 0x2aaaaab41bc0 (LWP 29308))]
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-62.13.2.x86_64 libgcc_s1-debuginfo-7.3.1+r258812-5.2.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64 libnuma1-debuginfo-2.0.9-9.1.x86_64 libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libz1-debuginfo-1.2.8-12.3.1.x86_64
(gdb) bt
#0  init_device_list () at ../../../src/uct/ugni/base/ugni_device.c:198
#1  0x00002aaabd50f945 in uct_ugni_md_open (md_name=<optimized out>, md_config=<optimized out>, md_p=0x7fffffff4438) at ../../../src/uct/ugni/base/ugni_md.c:198
#2  0x00002aaabd508369 in uct_md_open (md_name=md_name@entry=0x7b53e0 "ugni", config=0x7c2930, md_p=md_p@entry=0x7c51c0) at ../../../src/uct/base/uct_md.c:124
#3  0x00002aaabd296e3b in ucp_fill_tl_md (tl_md=0x7c51c0, md_rsc=0x7b53e0) at ../../../src/ucp/core/ucp_context.c:742
#4  ucp_fill_resources (config=0x7b5150, context=0x7b5260) at ../../../src/ucp/core/ucp_context.c:924
#5  ucp_init_version (api_major_version=<optimized out>, api_minor_version=<optimized out>, params=<optimized out>, config=0x7b5150, context_p=0x2aaabd288318 <ompi_pml_ucx+184>) at ../../../src/ucp/core/ucp_context.c:1205
#6  0x00002aaabd083af1 in mca_pml_ucx_open () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/openmpi/mca_pml_ucx.so
#7  0x00002aaaab895545 in mca_base_framework_components_open () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/libopen-pal.so.40
#8  0x00002aaaaad7b457 in mca_pml_base_open () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/libmpi.so.40
#9  0x00002aaaab89fe55 in mca_base_framework_open () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/libopen-pal.so.40
#10 0x00002aaaaad1dd1e in ompi_mpi_init () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/libmpi.so.40
#11 0x00002aaaaad44be5 in PMPI_Init_thread () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-3.1.2-ucx/lib/libmpi.so.40
#12 0x0000000000401cb7 in main ()

I assume inf is NULL (it fails at 0x10) but the value has been optimized out.

@hppritcha
Copy link
Member

It still looks like for some reason UCX isn't getting Aries network rdma cookie info. Could you try again using aprun but add the following to your environment:

export OMPI_MCA_btl=^openib
export PMI_DEBUG=1

@hppritcha
Copy link
Member

Another thing to try, could you make sure to get an allocation with 2 or more nodes? Some sites configure alps so that for single node jobs, no Aries network resources are allocated.

@devreal
Copy link
Contributor Author

devreal commented Dec 4, 2018

Running with 2 or more nodes:

$ export OMPI_MCA_btl=^openib
$ export PMI_DEBUG=1
$ mpirun -n 2 -N 1 ./mpi_progress
Tue Dec  4 15:21:53 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Tue Dec  4 15:21:53 2018: [unset]: _pmi_alps_init: _pmi_alps_get_apid returned with error: Bad file descriptor
Tue Dec  4 15:21:53 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Tue Dec  4 15:21:53 2018: [unset]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:21:53 2018: [unset]: PMI rank = 0 pg id = 0 appnum = 0 pes_per_smp = 0 pes_this_smp = 0
Tue Dec  4 15:21:53 2018: [unset]: _pmi_initialized = 0 spawned = 0
[1543933313.343398] [nid05686:32560:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1543933313.343455] [nid05686:32560:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
[nid05686:32560:0:32560] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
Tue Dec  4 15:21:53 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Tue Dec  4 15:21:53 2018: [unset]: _pmi_alps_init: _pmi_alps_get_apid returned with error: Bad file descriptor
Tue Dec  4 15:21:53 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Tue Dec  4 15:21:53 2018: [unset]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:21:53 2018: [unset]: PMI rank = 0 pg id = 0 appnum = 0 pes_per_smp = 0 pes_this_smp = 0
Tue Dec  4 15:21:53 2018: [unset]: _pmi_initialized = 0 spawned = 0
[nid05685:32755:0:32755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1543933313.344548] [nid05685:32755:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1543933313.344568] [nid05685:32755:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfa1a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfa3f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc766ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc766945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc75f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
   12  ./mpi_progress() [0x401baf]
   13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
   14  ./mpi_progress() [0x400fd9]
===================
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfa1a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfa3f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc766ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc766945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc75f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
   12  ./mpi_progress() [0x401baf]
   13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
   14  ./mpi_progress() [0x400fd9]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 32560 on node 5686 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Running under aprun:

$ PMI_NO_FORK=1 aprun -n 2 -N 1 ./a.out
Tue Dec  4 15:17:50 2018: [unset]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [unset]: _pmi_alps_init: pmi_rank=0, num_apps=1, my_appnum=0, my_local_apprank=0, apps_share_node=0, mynid=5685
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_kvs_hash_entries = 1
Tue Dec  4 15:17:50 2018: [PE_0]: mmap in a file for shared memory type 4 len 49216
Tue Dec  4 15:17:50 2018: [PE_0]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init: num_nodes = 2
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init: parent_id is -1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init:  num_targets 1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init: num_nodes = 2
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init: parent_id is -1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_control_net_init:  num_targets 1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 10.128.22.98 port 1371
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_listen_socket_setup: open listen_sock 12
Tue Dec  4 15:17:50 2018: [PE_0]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [PE_0]: PMI rank/pid filename : /var//opt/cray/alps/spool/2202949/pmi_attribs
Tue Dec  4 15:17:50 2018: [PE_0]: nid: 5685, appnum: 0, rank: 0, pid: 32273
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_alps_sync: Notifying ALPS of start
Tue Dec  4 15:17:50 2018: [unset]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [unset]: _pmi_alps_init: pmi_rank=1, num_apps=1, my_appnum=0, my_local_apprank=0, apps_share_node=0, mynid=5686
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_kvs_hash_entries = 1
Tue Dec  4 15:17:50 2018: [PE_1]: mmap in a file for shared memory type 4 len 49216
Tue Dec  4 15:17:50 2018: [PE_1]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init: num_nodes = 2
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init: parent_id is 0
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init:  controller_nid = 5685 controller hostname nid05685
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init: num_nodes = 2
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init: parent_id is 0
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_control_net_init:  controller_nid = 5685 controller hostname nid05685
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 10.128.22.99 port 1371
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_listen_socket_setup: open listen_sock 12
Tue Dec  4 15:17:50 2018: [PE_1]: ALPS returns apid 2202949
Tue Dec  4 15:17:50 2018: [PE_1]: PMI rank/pid filename : /var//opt/cray/alps/spool/2202949/pmi_attribs
Tue Dec  4 15:17:50 2018: [PE_1]: nid: 5686, appnum: 0, rank: 1, pid: 32078
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_alps_sync: Notifying ALPS of start
Tue Dec  4 15:17:50 2018: [PE_0]: completed PMI sync with launcher
Tue Dec  4 15:17:50 2018: [PE_0]: calling _pmi_inet_setup (full)
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname 
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: waiting on listening
Tue Dec  4 15:17:50 2018: [PE_1]: calling _pmi_inet_setup (full)
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: have controller = 1 controller hostname nid05685
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: got an accept complete from 10.128.22.99
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: target_id 5686 10.128.22.99 connected
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: open target fd 13
Tue Dec  4 15:17:50 2018: [PE_0]: PE 0 network_barrier:receiving message from target 0 nid 5686
Tue Dec  4 15:17:50 2018: [PE_0]: PE 0 network_barrier:received message from target 0 nid 5686 errno 0
Tue Dec  4 15:17:50 2018: [PE_0]: network_barrier:sending release packet to target 0
Tue Dec  4 15:17:50 2018: [PE_0]: calling _pmi_inet_setup (app)
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname 
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: waiting on listening
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: got an accept complete from 10.128.22.99
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: target_id 5686 10.128.22.99 connected
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_inet_setup: open target fd 14
Tue Dec  4 15:17:50 2018: [PE_0]: completed PMI TCP/IP inet setup
Tue Dec  4 15:17:50 2018: [PE_0]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_0]: PMI rank = 0 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_initialized = 0 spawned = 0
Tue Dec  4 15:17:50 2018: [PE_0]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_0]: PMI rank = 0 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_initialized = 0 spawned = 0
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=2 (hash=0)
Tue Dec  4 15:17:50 2018: [PE_0]: vector-process-mapping str = [(vector,(0,2,1))]
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,1)) (hash=0)
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_create_jobattrs - done
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_barrier: entering
Tue Dec  4 15:17:50 2018: [PE_0]: PE 0 network_barrier:receiving message from target 0 nid 5686
Tue Dec  4 15:17:50 2018: [PE_0]: PE 0 network_barrier:received message from target 0 nid 5686 errno 0
Tue Dec  4 15:17:50 2018: [PE_0]: network_barrier:sending release packet to target 0
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_barrier: exiting
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,1)))
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi2_info_get_jobattr: FOUND a match to name=universeSize, (val=2)
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: using port 1371 sleep 2 retry 300
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: open controller fd 13
Tue Dec  4 15:17:50 2018: [PE_1]: network_barrier:sending barrier packet to my controller
Tue Dec  4 15:17:50 2018: [PE_1]: calling _pmi_inet_setup (app)
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: have controller = 1 controller hostname nid05685
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: using port 1371 sleep 2 retry 300
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_inet_setup: open controller fd 14
Tue Dec  4 15:17:50 2018: [PE_1]: completed PMI TCP/IP inet setup
Tue Dec  4 15:17:50 2018: [PE_1]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_1]: PMI rank = 1 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_initialized = 0 spawned = 0
Tue Dec  4 15:17:50 2018: [PE_1]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_1]: PMI rank = 1 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_initialized = 0 spawned = 0
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=2 (hash=0)
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,1)) (hash=0)
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_create_jobattrs - done
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_barrier: entering
Tue Dec  4 15:17:50 2018: [PE_1]: network_barrier:sending barrier packet to my controller
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_barrier: exiting
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,1)))
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi2_info_get_jobattr: FOUND a match to name=universeSize, (val=2)
Tue Dec  4 15:17:50 2018: [PE_1]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_1]: PMI rank = 1 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_initialized = 1 spawned = 0
Tue Dec  4 15:17:50 2018: [PE_0]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec  4 15:17:50 2018: [PE_0]: PMI rank = 0 pg id = 2202949 appnum = 0 pes_per_smp = 1 pes_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: base_pe[0] = 0 pes_in_app[0] = 2 pes_in_app_this_smp = 1
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_initialized = 1 spawned = 0
[1543933070.487171] [nid05685:32273:0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1543933070.487711] [nid05686:32078:0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[nid05686:32078] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE
[nid05685:32273] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 16 len_global 32
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 16 len_global 32
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_allgatherv calling _pmi_smp_gatherv: len=197 in=0x2aaab0000eb0
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_smp_gatherv: rank:0 len=197 maxloop=1 maxblocksize=197
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 197 len_global 364
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_allgatherv calling _pmi_smp_bcast, len=364
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_allgatherv calling _pmi_smp_gatherv: len=167 in=0x2aaab0000eb0
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_smp_gatherv: rank:0 len=167 maxloop=1 maxblocksize=167
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 167 len_global 364
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_allgatherv calling _pmi_smp_bcast, len=364
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 16 len_global 32
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_allgatherv calling _pmi_smp_gatherv: len=197 in=0x2aaab00026f0
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_smp_gatherv: rank:0 len=197 maxloop=1 maxblocksize=197
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 197 len_global 364
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 16 len_global 32
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_allgatherv calling _pmi_smp_gatherv: len=167 in=0x2aaab00026b0
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_smp_gatherv: rank:0 len=167 maxloop=1 maxblocksize=167
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 167 len_global 364
Tue Dec  4 15:17:50 2018: [PE_0]: _pmi_allgatherv calling _pmi_smp_bcast, len=364
Tue Dec  4 15:17:50 2018: [PE_1]: _pmi_allgatherv calling _pmi_smp_bcast, len=364
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes

The last two lines are from the application right after MPI_Init_thread returned. The outcome is the same for mpirun if I replace MPI_Init_thread with MPI_Init.

Please let me know if I can provide anything else.

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

@devreal Did you see this:

[nid05686:32078] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE
[nid05685:32273] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:228 Error: UCP worker does not support MPI_THREAD_MULTIPLE

That would be why the aprun one failed with thread multiple. Out of curiosity, why are you trying to use UCX on a Cray?

@devreal
Copy link
Contributor Author

devreal commented Dec 4, 2018

Ahh, so that is why Open MPI falls back to ob1.

Originally, I was trying to use UCX when working on #5915 to see what happens when hugepages are passed to UCX. I understand that the performance might not be as good as with ugni. This is not of high priority and I cannot rule out that it is caused by a weird configuration on the site (unfortunately, I don't have access to another Cray system to verify this). However, Open MPI should imo not segfault if the PMI initialization fails so I thought I am reporting the crashes in case someone else is trying to use Open MPI in combination with UCX on Aries.

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

@devreal I agree. You might also test with the btl/uct path if you want to test hugepage stuff. You can enable it with --mca btl self,vader,uct --mca btl_uct_memory_domains ugni. It will also fail with the PMI failure with mpirun but it doesn't need the thread-multiple support in UCX. It does its own thing to support multiple threads.

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

(master only)

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

FWIW, it crashes on an XC-40 with slurm as well. I think it is a UCX bug but I haven't had the time to figure it out.

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

Yup. They unconditionally call into PMI (bad) instead of checking the environment first.

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

Will fix this and PR to UCX 1.4.x

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

openucx/ucx#3080

@hjelmn
Copy link
Member

hjelmn commented Dec 4, 2018

That gets you past basic initialization. There is another UCX bug that is unrelated to this one.

@devreal
Copy link
Contributor Author

devreal commented Jan 15, 2019

Thanks for the fix @hjelmn! I finally had time to test again. I'm getting past MPI_Init but get an error in MPI_Win_allocate:

[nid03575:15492:0:15492] ugni_udt_ep.c:178  UCX Assertion `msg_length <= 128' failed: msg_length=153

Backtrace:

#31 main () (at 0x000000000040b706)
#30 PMPI_Win_allocate (size=8, disp_unit=1, info=0xb141d0, comm=0x8cdde0, baseptr=0x7fffffff5b90, win=0x7fffffff5b88) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mpi/c/profile/pwin_allocate.c:81 (at 0x00000000004476f8)
#29 ompi_win_allocate (size=8, disp_unit=1, info=0xb141d0, comm=0x8cdde0, baseptr=0x7fffffff5b90, newwin=0x7fffffff5b88) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/win/win.c:278 (at 0x000000000043c149)
#28 ompi_osc_base_select (win=0xb14260, base=0x7fffffff5ad0, size=8, disp_unit=1, comm=0x8cdde0, info=0xb141d0, flavor=2, model=0x7fffffff5adc) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/osc/../../../../ompi/mca/osc/base/osc_base_init.c:74 (at 0x000000000049d072)
#27 ompi_osc_rdma_component_select (win=0xb14260, base=0x7fffffff5ad0, size=8, disp_unit=1, comm=0x8cdde0, info=0xb141d0, flavor=2, model=0x7fffffff5adc) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/osc/rdma/../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:1162 (at 0x00000000004a646d)
#26 ompi_comm_dup (comm=0x8cdde0, newcomm=0xb14c98) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm.c:973 (at 0x000000000041529c)
#25 ompi_comm_dup_with_info (comm=0x8cdde0, info=0x0, newcomm=0xb14c98) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm.c:1008 (at 0x0000000000415399)
#24 ompi_comm_nextcid (newcomm=0xafad20, comm=0x8cdde0, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=false, mode=32) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm_cid.c:295 (at 0x0000000000418a61)
#23 ompi_request_wait_completion (req=0xb16c18) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/request/request.h:415 (at 0x00000000004182de)
#22 opal_progress () at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/opal/../../opal/runtime/opal_progress.c:230 (at 0x00002aaaabd7abd6)
#21 ompi_comm_request_progress () at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm_request.c:140 (at 0x000000000041c9a6)
#20 ompi_comm_allreduce_getnextcid (request=0xb16c18) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm_cid.c:340 (at 0x0000000000418c10)
#19 ompi_comm_allreduce_intra_nb (inbuf=0xb16b74, outbuf=0xb16b70, count=1, op=0x8d6500, context=0xb16b20, req=0x7fffffff5720) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/../../ompi/communicator/comm_cid.c:647 (at 0x00000000004197f8)
#18 ompi_coll_libnbc_iallreduce (sendbuf=0xb16b74, recvbuf=0xb16b70, count=1, datatype=0x8ae500, op=0x8d6500, comm=0x8cdde0, request=0x7fffffff5720, module=0xb10790) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/coll/libnbc/../../../../../ompi/mca/coll/libnbc/nbc_iallreduce.c:228 (at 0x0000000000475446)
#17 NBC_Start (handle=0xb17868) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/coll/libnbc/../../../../../ompi/mca/coll/libnbc/nbc.c:660 (at 0x00000000004714ed)
#16 NBC_Start_round (handle=0xb17868) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/coll/libnbc/../../../../../ompi/mca/coll/libnbc/nbc.c:461 (at 0x0000000000470cd2)
#15 mca_pml_ucx_isend (buf=0xb16b74, count=1, datatype=0x8ae500, dst=0, tag=-27, mode=MCA_PML_BASE_SEND_STANDARD, comm=0x8cdde0, request=0xb180c0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:697 (at 0x000000000054bad8)
#14 mca_pml_ucx_get_ep (comm=0x8cdde0, rank=0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:380 (at 0x000000000054a8e0)
#13 mca_pml_ucx_add_proc (comm=0x8cdde0, dst=0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:344 (at 0x000000000054a7e4)
#12 mca_pml_ucx_add_proc_common (proc=0xb180e0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/git/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:318 (at 0x000000000054a707)
#11 ucp_ep_create (worker=0x2aaaaac12010, params=0x7fffffff53a0, ep_p=0x7fffffff5388) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucp/../../../src/ucp/core/ucp_ep.c:593 (at 0x00002aaaac856818)
#10 ucp_ep_create_api_to_worker_addr (ep_p=<synthetic pointer>, params=0x7fffffff53a0, worker=0x2aaaaac12010) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucp/../../../src/ucp/core/ucp_ep.c:560 (at 0x00002aaaac856818)
#9 ucp_wireup_send_request (ep=0x2aaab08d9048) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucp/../../../src/ucp/wireup/wireup.c:844 (at 0x00002aaaac894c95)
#8 ucp_wireup_msg_send (ep=ep@entry=0x2aaab08d9048, type=type@entry=1 '\001', tl_bitmap=tl_bitmap@entry=56, rsc_tli=rsc_tli@entry=0x7fffffff528a "\005\377\377\377\377\377H\220\215\260\252*") at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucp/../../../src/ucp/wireup/wireup.c:171 (at 0x00002aaaac8925cd)
#7 ucp_request_send (pending_flags=0, req=0xb3edd0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/../src/ucp/core/ucp_request.inl:204 (at 0x00002aaaac8925cd)
#6 ucp_request_try_send (pending_flags=0, req_status=0x7fffffff5207, req=0xb3edd0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/../src/ucp/core/ucp_request.inl:169 (at 0x00002aaaac8925cd)
#5 ucp_wireup_msg_progress (self=0xb3ee78) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucp/../../../src/ucp/wireup/wireup.c:66 (at 0x00002aaaac89219e)
#4 uct_ep_am_bcopy (flags=<optimized out>, arg=0xb3edd0, pack_cb=0x2aaaac8920e0 <ucp_wireup_msg_pack>, id=1 '\001', ep=<optimized out>) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/../src/uct/api/uct.h:1892 (at 0x00002aaaac89219e)
#3 uct_ugni_udt_ep_am_bcopy (tl_ep=0xb1aae0, id=<optimized out>, pack_cb=<optimized out>, arg=<optimized out>, flags=<optimized out>) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/uct/../../../src/uct/ugni/udt/ugni_udt_ep.c:227 (at 0x00002aaaacada0d4)
#2 uct_ugni_udt_ep_am_common_send (arg=<optimized out>, pack_cb=<optimized out>, payload=0x0, header=0, length=0, am_id=<optimized out>, iface=0xab5f60, ep=0xb1aae0, is_short=0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/uct/../../../src/uct/ugni/udt/ugni_udt_ep.c:178 (at 0x00002aaaacada0d4)
#1 ucs_fatal_error (error_type=error_type@entry=0x2aaaacae826e "assertion failure", file=file@entry=0x2aaaacae8170 "../../../src/uct/ugni/udt/ugni_udt_ep.c", line=line@entry=178, function=function@entry=0x2aaaacae8420 "uct_ugni_udt_ep_am_common_send", format=format@entry=0x2aaaacae8340 "Assertion `%s' failed: msg_length=%u") at /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build/src/ucs/../../../src/ucs/debug/assert.c:35 (at 0x00002aaaacf2f061)
#0 abort () from /lib64/libc.so.6 (at 0x00002aaaae98f200)

Strangely, I see that the uct BTL fails initialization when running with --mca btl_base_verbose 100, not sure if that is related:

[nid03574:01645] select: initializing btl component uct
[nid03574][[15215,1],0][../../../../../opal/mca/btl/uct/btl_uct_component.c:419:mca_btl_uct_component_init] initializing uct btl
[nid03574][[15215,1],0][../../../../../opal/mca/btl/uct/btl_uct_component.c:423:mca_btl_uct_component_init] no uct memory domains specified
[nid03574:01645] select: init of component uct returned failure
[nid03574:01645] mca: base: close: component uct closed
[nid03574:01645] mca: base: close: unloading component uct

Tested with Open MPI commit dc6eb5d1a2, OpenUCX commit f4cd8ee6.

@hjelmn
Copy link
Member

hjelmn commented Jan 15, 2019 via email

@hjelmn
Copy link
Member

hjelmn commented Jan 15, 2019

Also, to enable btl/uct you need to add --mca btl_uct_memory_domains ugni

@hppritcha
Copy link
Member

closing as won't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants