-
Notifications
You must be signed in to change notification settings - Fork 899
Problems with cuda aware MPI and Omnipath networks #4899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @edgargabriel, Would you please share what is the message size you see this abort? |
@edgargabriel
https://github.com/intel/opa-psm2/blob/PSM2_10.2-235/ptl_am/ptl.c#L152 |
It is definitely homogeneous, it is the version taken from the Open HPC roll. At the bare minimum, it is identical on all GPU nodes, but I would suspect that it is in fact the same on all compute nodes. Would you recommend that I recompile libpsm2? And if yes, is it possible to have multiple versions of libpsm2 on the system? |
I would like to start by confirming that PSM2 has cuda support. If you add to your mpirun command Yes, you can have multiple versions of libpsm2 in the system. Just make sure to set LD_LIBRARY_PATH accordingly: |
@matcabral I will try to get the information, my job is currently queued. I will also try to compile in parallel a new version of the psm2 library,. Thanks for you help! |
I think you probably right, our PSM2 library does not have CUDA support built in. Not entirely clear to me how any of the tests worked in that case. Anyway, I will try to compile a new version of psm2 with CUDA support, and will let you know. |
OMPI has a native CUDA suport. So it should work even with other transports (e.g. sockets, but I have not tested it). However, the PSM2 CUDA support in OMPI expects you have libpsm2 with CUDA support. Unexpected results if you mix. Maybe there are some non CUDA buffers sent ? However, when you effectively use PSM2 CUDA in OMPI (OMPI CUDA build) with a libpsm2 CUDA build, you will get a significant performance boost. |
This might be off topic for this item (and I would be happy to discuss it offline), but I have problems compiling psm2 with CUDA support. Without CUDA support the library compiles without any issues, the moment I set PSM_CUDA=1 I get however error message related to undefined symbols and structures, e.g.
I searched google for solutions but I could not find anything. I could also not find those symbols in the linux kernel (e.g. kernel-source//include/uapi/rdma/hfi/ or similar). Any ideas/hints on what am I missing? |
Quick answer: to achieve the zero-copy transfers, libpsm2 uses a special version of the hfi1 drivers (OPA HFI driver). The driver headers you have available most likely don't have CUDA support. As you noticed, you will need the hfi1 driver with CUDA support loaded in the system. |
@edgargabriel are you using the Intel® Omni-Path Fabric Software package? This is in fact the simplest way to get this setup. I suspect your nodes already satisfy NVIDIA software requirements section 4.4. Then proceed to 5.1.1 "./INSTALL -G" (Install GPUDirect* components). This will install the libpsm2 and hfi1 drivers with CUDA support, and in addition an build OMPI with CUDA support at /usr/mpi/gcc/openmpi-2.1.2-cuda-hfi/. However, if you still want to build. The source rpms for all the components are also included. |
Hi @edgargabriel, any news ? |
@matcabral : our system administrators performed the update of the OPA stack on the cluster to include CUDA aware packages. It took a while since it is a production system, but it is finally done. I ran a couple of tests on Monday, but I still face some problems. although the error message are now different. I will try to gather the precise cases and error messages. |
@matcabral: before I post the error messages, I would like to clarify one point. The new software stack that is installed on the system does how CUDA support compiled into it. I can verify that two ways, a) I can successfully compile my psm2 library using PSM_CUDA=1 (which I could not before) and b) if I ran
which it did not report before. However, if I use the first method that you report, I still get an error message:
Is that ok, or might this point to a problem ? |
Note that that hfi1 driver binary loaded must also be the CUDA one. |
ok, this looks better, thanks :-)
|
First, the scenario that I am working with right now is one node, two GPUs, two MPI processes, each MPI process uses one GPU. I have three test cases, (and once I can figure out how to upload the code to github I am happy to provide them). I am not excluding the possibility that something is wrong in my test cases.
Note, that the length is number of elements of type MPI_DOUBLE, not number of bytes. |
both cases should work. You may confirm with OSU MPI benchmarks that have CUDA support: http://mvapich.cse.ohio-state.edu/benchmarks/ . NOTE that OMPI does NOT yet support CUDA on non blocking collectives: https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-no-cuda |
well, the situation is pretty much the same. If I run an osu benchmark directly using psm2, I get the same error message, if I tell mpirun to switch to ob1 everything works even from device memory.
|
I see that this is using the OFI MTL which does not have CUDA support. You should use the PSM2 MTL (I'm surprised this is not selected by default.... ) I assume you OMPI does have CUDA support, right? |
@matcabral yes, it is compiled with cuda support, and forcing using the psm2 mtl made the osu benchmark work! That is good news, thanks! Some of my own test cases are now also working, but a few still fail with a new error message:
I will try to follow up on that tomorrow, Thanks for your help! I will keep you posted. |
output of ompi_info
|
good news, I have a slightly modified version of my testcases working as well. I will try to find some time in the next couple of days to distill why precisely my original version didn't work ( in my opinion it should), but for now I am glad we have got it working. I will also still have to test multi-node cases, but not tonight. @matcabral thank you for your help! |
This seems to be a GPU affinity issue. libpsm2 is initialized during MPI_Init() time and setting GPU affinity by default to device 0. If you try to change it after MPI_Init() will give the above error. Solution, do cudaSetDevice before MPI_Init. |
@matcabral thank you. You were right, I did the cudaSetDevice after MPI_Init (although the original user code did that before MPI_Init), and I can confirm that this resolved the last issue. To resolve the psm2 over ofi selection problem, I increased the psm2 priority in the mca-params.conf file, this seems to do the trick for now. I think the problem stems from this code sniplet in ompi_mtl_psm2_component_register:
I am still waiting to hear back from the user whether his application also ran successfully. I will close the ticket however, can always reopen if there are other issues. Thanks! |
Hi @edgargabriel, Right, you are probably running all ranks locally. This piece of code was thought to favor vader btl over libpsm2 shm device: doing memcpy higher in the stack is more efficient ;) . I will look to add an thanks! |
@matcabral I am afraid I have to reopen this issue. Our user is running into some new error messages. Just to recap, this is using Open MPI 3.1.0, psm2-10.3-8 with cuda 9.1.85. Basically, the job aborts after some time. He was able to boil it down to a testcase with 2 nodes, 4 GPUs (two per node), and the error message is as follows:
Any ideas on how to debug it? I tried to install the newest psm2 library version to see whether the problem is solved by doing that, but unfortunately that version does not compile on our cluster because of some errors stemming from the new gprcpy feature. |
Hi @edgargabriel , from the error log you provided, it looks like error itself was from CUDA API failing. This could be some CUDA API runtime issue (Googling around for the error code seems to indicate as much). Could you please confirm that there is no discrepancy in the CUDA runtime version vs. the CUDA driver APIs? If there is some mismatch, it is likely you will see CUDA call fails. (nvidia-smi should give you info about driver versions etc.). Beyond that, PSM2 did have some CUDA related fixes in newer versions of the library. So, it might be that using a newer version of libpsm2 and hfi1 driver resolves the problems. (The PSM2 version you are using is more than 6 months old which was originally tested with CUDA 8) Regarding any compile issues with new libpsm2 due to gdrcopy feature: you will need the latest hfi1 driver component as well. So, the easiest way to get all the relevant updates would be through IFS-10.7 Link to latest install guide: IFS install guide The following command should work by upgrading currently installed PSM2 CUDA components- |
@aravindksg thank you for your feedback. By discrepancy between CUDA runtime version and CUDA drivers, are you referring to version mismatch between the libcudart.so file used and the client side CUDA libraries? IF yes, then this is clearly not the case. I double checked our LD_LIBRARY_PATH, and there is no other directory that could accidentally be loaded from. In addition, we have a gazillion of non-MPI jobs that run correctly in this setting, if it is a version mismatch, I would expect that some of them would also fail. Regarding the update of the software, I will trigger that with our administrators. Since this is a production cluster, that can take however a while (the hfi component can not be updated by a regular user as far as I can see, libpsm2 would be possible however). |
Hi @edgargabriel, https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html cudaErrorUnknown = 30 Despite the fact this does not say much, suggests something going wrong on the CUDA stack. So, it seems like the next logical step would be to scope and try to reproduce the issue:
|
An update on this item:
And not sure whether this is helpful, but here are all the rpm's that I found on the node that have either cuda or nvidia in the name:
Anyway, any suggestions what precisely to look for would be appreciated, I am at this point out of ideas. |
Hi @edgargabriel, all I can think at this point is finding different workloads that use cudaIpcOpenMemHandle(), or even one not using MPI, to see how it behaves. In addition, if your workload is publicly available, we could try to reproduce it on our side. |
@matcabral I ran the simpleIPC test from the cuda9.1.85/samples directory. This test uses cudaIpcOpenMemHandle(), is as far as I understand however only designed for running on one node. It used both GPUs, and finished as far as I can tell correctly. I am still looking for an example that we could run across multiple nodes. Regarding the code, the application is called bluebottle, and you can download it from github. There are very good instructions for compiling the application (requires HDF5 and CGNS in addition to MPI). I can send you also the smalles testcase that the user could produce which reliably failed on our cluster. It requires 2 nodes with 4 GPUs total. I would however prefer to send you the dataset by email or provide you the link separately, if that is ok. |
So the application links against the libraries shown above (libcudart.so.9.1), but there is another library at play here from nvidia, example from 9.1 on a SLES 12.3 machine. (This is also running 9.1.85 FYI). /usr/lib64/libcuda.so.1 -> libcuda.so.387.26 Any mismatches in runtime vs drivers are typically what cause the unknown error. I see that the RPM list you have installed is running a lot of stock in distro nvidia drivers. Potentially maybe there is some conflict between real nvidia drivers+sw and the in-distro versions? |
@rwmcguir : I went to the Nvidia webpage, when you try to download the RHEL rpm's for CUDA 9.1, there is a big message showing up saying: "Before installing the CUDA Toolkit on Linux, please ensure that you have the latest NVIDIA driver R390 installed. The latest NVIDIA R390 driver is available at: www.nvidia.com/drivers", see |
@rwmcguir I also went to the CUDA driver download pages, and configured download for the drivers and CUDA version that we use, and the recommended driver from Nvidia was 390.46
|
@edgargabriel, I found this is still open. I understand libpsm2 updated substantial number of things in 11.x versions. Do you still see this issue? |
@matcabral I think we can close it for now. Not sure whether we still see the issue or not, the last CUDA and PSM update that we did was in November. We found however a configuration on some (new) hardware that worked for this user ( single node with 8 GPUs), and because of that we haven't looked into this lately. |
Is this a duplicate of this issue? and Looks like it was never got to the conclusion, but might be same effect. |
See open-mpi/ompi#4899 for details. The fix is in openmpi 4.0.1 but not 3.1.2 yet.
See open-mpi/ompi#4899 for details. The fix is in openmpi 4.0.1 but not 3.1.2 yet.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.1.0rc2 and master
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
Please describe the system on which you are running
Details of the problem
We have a user code that is able that makes use of cuda-aware MPI features for direct data transfer across multiple GPUs. The code has utilized successfully fairly large InfiniBand clusters. We face however a problem when executing it on our Omnipath cluster
@bosilca pointed out to me the following commit
2e83cf1
which is the reason we turned to the 3.1 release candidate, since this commit is part of this version.
The good news is, that using ompi 3.1.0.rc2, the code runs correctly on a single node /multi GPU environment. Running the code on multiple nodes and with multiple GPUs still fails however. A simple benchmark was able identify that direct transfer from GPU memory across multiple nodes works correctly up to a certain message length, but fails if the message length exceeds a threshold. The error message comes directly form the psm2 library, and is attached below.
My question is now, whether there is a minimum psm2 library version that is required to make this feature work correctly. Our cluster uses currently libpsm2-10.2.235, and there are obviously newer versions out there (newest one being 10.3.37 I think).
As a side note, we did manage to make the code work by using the verbs API and disabling cuda_async_recv, e.g.
but this slows down the communication performance quite a bit compared to using the psm2 library directly.
The text was updated successfully, but these errors were encountered: