-
Notifications
You must be signed in to change notification settings - Fork 898
MPI_Info_dup: allocate info through ompi_info_allocate instead of OBJ_NEW #10349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…_NEW The call to ompi_info_allocate ensures that the ompi instance is properly retained (similar to MPI_Info_create). The instance is then released in MPI_Info_free. Thanks to Lisandro Dalcin for reporting and providing an easy reproducer. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@dalcinl FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this patch is complete. Basically, we should not mix info_allocate/info_free with OBJ_NEW/OBJ_RELEASE. But we do all over the code, so there are many other places where things will fail (check all the uses of OBJ_NEW(ompi_info_t)
in the IO part of the code).
After looking at the code, it seems that we should move everything from info_allocate/info_free into the constructor/destructor.
@bosilca My understanding is that the internally allocated info objects you mentioned are attached to another object (files, windows, comms) that have retained the ompi instance already. Info objects created through MPI_Info_dup and MPI_Info_create are free-standing so they have to retain the instance themselves. |
In fact, if we retain the instance in every info object attached to predefined comms/windows/files we will run into similar cyclic dependencies as with predefined attributes: #10350 |
Let's take mca/io/romio341/src/io_romio341_file_open.c as an example. We allocate with OBJ_NEW, we populate from an opal_object, we then make a copy through MPI_File_open. At this point the local ompi_info object was not yet added to the instance. And then we release with ompi_info_free that will call |
OK, that clearly is a bug then :/ |
I don't agree. My reason is that we are able to correctly handle the f2c translation via OBJ_NEW/OBJ_RELEASE, so we should be able to handle the instance refcount. |
Here is the cycle I'm talking about:
The romio instances of |
@devreal I think you are right, this file is a relict of a time long gone, and we forgot to remove it. I do not see any path to invoke this function. We should probably remove it, I do not have however access to a gpfs file system anymore to ensure that we do not break something by removing it. |
@devreal I don't think there is a problematic cycle because the mutex protecting the instance being a recursive mutex it will allow us to go through the fast path on the second call. |
@bosilca The mutex is not the problem. The problem is that the destruction of the ompi instance would depend on the destruction of the info object because the info object has retained the instance and would have to call |
No need to allocate it on the heap using OBJ_NEW. Also fixes a mismatch between OBJ_NEW and ompi_info_free that potentially leads to inconsistencies in ref-counting the ompi instance. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@bosilca and I had a quick chat offline about this and the outcome was that we both didn't know why info objects have to retain the ompi instance in the first place. My guess is that it has to do with the environment info but maybe that can be handled differently? @hjelmn Can you shed some light here? |
To add to @devreal comment, my main issue with the info objects is that while the info object is derived from opal_object_t it should never be used as such, because using them with OBJ_NEW/OBJ_RELEASE will lead to inconsistencies with the instance object. We are basically creating two classes of objects derived from the same opal_object_t. |
@devreal Any chance you could retarget target this PR to branch v5.0.x? I really hope this fix gets in for the upcoming 5.0.0 release. |
@hppritcha Can you have a look at this, since a question has come up on this PR that involves sessions? |
The call to ompi_info_allocate ensures that the ompi instance is properly retained (similar to MPI_Info_create). The instance is then released in MPI_Info_free.
Thanks to Lisandro Dalcin for reporting and providing a simple reproducer (see #10344 (comment))
Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu