-
Notifications
You must be signed in to change notification settings - Fork 897
Issues running with Open UCX 1.4 on Cray XC40 #6084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bosilca @hppritcha Who should this Cray UCX issue be assigned to? |
Does the job work okay if you use aprun as the job launcher rather than mpirun? Also could you add |
I recommend not using UCX on a Cray. You will get better performance with the built-in uGNI support. |
@hppritcha I was using the official 1.4 release downloaded from the website and configured using:
I can post the full When running under
When running under
|
I think you may have been zapped by 4eeb415. Could you try and test with an older Open MPI release like 3.1.2? |
@hppritcha I tried with Open MPI 3.1.2 and Open UCX 1.4 but the error is the same. I got a stack trace from a core dump though:
I assume |
It still looks like for some reason UCX isn't getting Aries network rdma cookie info. Could you try again using aprun but add the following to your environment:
|
Another thing to try, could you make sure to get an allocation with 2 or more nodes? Some sites configure alps so that for single node jobs, no Aries network resources are allocated. |
Running with 2 or more nodes: $ export OMPI_MCA_btl=^openib
$ export PMI_DEBUG=1
$ mpirun -n 2 -N 1 ./mpi_progress
Tue Dec 4 15:21:53 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Tue Dec 4 15:21:53 2018: [unset]: _pmi_alps_init: _pmi_alps_get_apid returned with error: Bad file descriptor
Tue Dec 4 15:21:53 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Tue Dec 4 15:21:53 2018: [unset]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec 4 15:21:53 2018: [unset]: PMI rank = 0 pg id = 0 appnum = 0 pes_per_smp = 0 pes_this_smp = 0
Tue Dec 4 15:21:53 2018: [unset]: _pmi_initialized = 0 spawned = 0
[1543933313.343398] [nid05686:32560:0] ugni_device.c:137 UCX ERROR PMI_Init failed, Error status: -1
[1543933313.343455] [nid05686:32560:0] ugni_device.c:182 UCX ERROR Could not fetch PMI info.
[nid05686:32560:0:32560] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
Tue Dec 4 15:21:53 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Tue Dec 4 15:21:53 2018: [unset]: _pmi_alps_init: _pmi_alps_get_apid returned with error: Bad file descriptor
Tue Dec 4 15:21:53 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Tue Dec 4 15:21:53 2018: [unset]: PMI Version: 5.0.14 git rev: 8e2393a9
Tue Dec 4 15:21:53 2018: [unset]: PMI rank = 0 pg id = 0 appnum = 0 pes_per_smp = 0 pes_this_smp = 0
Tue Dec 4 15:21:53 2018: [unset]: _pmi_initialized = 0 spawned = 0
[nid05685:32755:0:32755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1543933313.344548] [nid05685:32755:0] ugni_device.c:137 UCX ERROR PMI_Init failed, Error status: -1
[1543933313.344568] [nid05685:32755:0] ugni_device.c:182 UCX ERROR Could not fetch PMI info.
0 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfa1a0]
1 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfa3f4]
2 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc766ba0]
3 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc766945]
4 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc75f369]
5 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
6 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
7 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
8 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
9 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
10 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
11 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
12 ./mpi_progress() [0x401baf]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
14 ./mpi_progress() [0x400fd9]
===================
0 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabcdfa1a0]
1 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabcdfa3f4]
2 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabc766ba0]
3 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabc766945]
4 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabc75f369]
5 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabc211e3b]
6 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xc1) [0x2aaab7d57ec1]
7 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x175) [0x2aaaab8d1b05]
8 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0xd8aaf) [0x2aaaaada7aaf]
9 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0x81) [0x2aaaab8dd3e1]
10 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x69b) [0x2aaaaad206bb]
11 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(PMPI_Init_thread+0x55) [0x2aaaaad50dc5]
12 ./mpi_progress() [0x401baf]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab23d725]
14 ./mpi_progress() [0x400fd9]
===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 32560 on node 5686 exited on signal 11 (Segmentation fault).
-------------------------------------------------------------------------- Running under
The last two lines are from the application right after Please let me know if I can provide anything else. |
@devreal Did you see this:
That would be why the aprun one failed with thread multiple. Out of curiosity, why are you trying to use UCX on a Cray? |
Ahh, so that is why Open MPI falls back to Originally, I was trying to use UCX when working on #5915 to see what happens when hugepages are passed to UCX. I understand that the performance might not be as good as with ugni. This is not of high priority and I cannot rule out that it is caused by a weird configuration on the site (unfortunately, I don't have access to another Cray system to verify this). However, Open MPI should imo not segfault if the PMI initialization fails so I thought I am reporting the crashes in case someone else is trying to use Open MPI in combination with UCX on Aries. |
@devreal I agree. You might also test with the btl/uct path if you want to test hugepage stuff. You can enable it with |
(master only) |
FWIW, it crashes on an XC-40 with slurm as well. I think it is a UCX bug but I haven't had the time to figure it out. |
Yup. They unconditionally call into PMI (bad) instead of checking the environment first. |
Will fix this and PR to UCX 1.4.x |
That gets you past basic initialization. There is another UCX bug that is unrelated to this one. |
Thanks for the fix @hjelmn! I finally had time to test again. I'm getting past
Backtrace:
Strangely, I see that the uct BTL fails initialization when running with
Tested with Open MPI commit |
UCX is broken on Cray right now. The UCP init message is too big for UDT to handle. You can get past that with --mca pml ob1 --mca osc ^ucx. There is an open bug on the issue:
openucx/ucx#3084
|
Also, to enable btl/uct you need to add --mca btl_uct_memory_domains ugni |
closing as won't fix. |
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI git master (592e2cc)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
configured using with support for Open UCX 1.4 (downloaded from openucx.com) using configure flags
--with-cray-pmi --enable-debug --with-ucx
Please describe the system on which you are running
Details of the problem
Trying to run a job on that machine leads to the following error and eventually crash:
This issue could be related to #5973 (as it's the same machine).
Interestingly, I seem to be unable to disable the UCX plm using
--mca pml_ucx_priority 0
(assuming that that is the right way to do it), it does not change the outcome. With Open MPI configured without UCX I am able to run Open MPI applications on that machine (using--oversubscribe
).Another interesting observation is that when running the Open MPI + UCX under DDT I get the following error:
As suggested, here is the output of
ipcs -l
on the node:I am able to debug Open MPI applications if Open MPI was built without support for UCX.
The text was updated successfully, but these errors were encountered: