Skip to content

MPI_Info_dup: allocate info through ompi_info_allocate instead of OBJ_NEW #10349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 23, 2022

Conversation

devreal
Copy link
Contributor

@devreal devreal commented May 3, 2022

The call to ompi_info_allocate ensures that the ompi instance is properly retained (similar to MPI_Info_create). The instance is then released in MPI_Info_free.

Thanks to Lisandro Dalcin for reporting and providing a simple reproducer (see #10344 (comment))

Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu

…_NEW

The call to ompi_info_allocate ensures that the ompi instance is properly
retained (similar to MPI_Info_create). The instance is then released in
MPI_Info_free.

Thanks to Lisandro Dalcin for reporting and providing an easy reproducer.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal devreal requested review from jsquyres and hjelmn May 3, 2022 14:23
@jsquyres
Copy link
Member

jsquyres commented May 3, 2022

@dalcinl FYI

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this patch is complete. Basically, we should not mix info_allocate/info_free with OBJ_NEW/OBJ_RELEASE. But we do all over the code, so there are many other places where things will fail (check all the uses of OBJ_NEW(ompi_info_t) in the IO part of the code).
After looking at the code, it seems that we should move everything from info_allocate/info_free into the constructor/destructor.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

@bosilca My understanding is that the internally allocated info objects you mentioned are attached to another object (files, windows, comms) that have retained the ompi instance already. Info objects created through MPI_Info_dup and MPI_Info_create are free-standing so they have to retain the instance themselves.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

In fact, if we retain the instance in every info object attached to predefined comms/windows/files we will run into similar cyclic dependencies as with predefined attributes: #10350

@bosilca
Copy link
Member

bosilca commented May 3, 2022

Let's take mca/io/romio341/src/io_romio341_file_open.c as an example. We allocate with OBJ_NEW, we populate from an opal_object, we then make a copy through MPI_File_open. At this point the local ompi_info object was not yet added to the instance. And then we release with ompi_info_free that will call ompi_mpi_instance_release. Same pattern can be observed around all the other usage of the info in the IO.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

OK, that clearly is a bug then :/ OBJ_NEW should be matched with OBJ_RELEASE but I think we need both variants.

@bosilca
Copy link
Member

bosilca commented May 3, 2022

I don't agree. My reason is that we are able to correctly handle the f2c translation via OBJ_NEW/OBJ_RELEASE, so we should be able to handle the instance refcount.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

Here is the cycle I'm talking about:

ompi_mpiinfo_init is called from ompi_mpi_instance_retain and in turn calls OBJ_CONSTRUCT(..., ompi_info_t). If we called ompi_mpi_instance_retain from within the ompi_info_t constructor we'd have a cycle and the instance will not be destroyed. Unless we want to restructure more code I don't see how to fix this.

The romio instances of OBJ_NEW(ompi_info_t) is easy to solve and I will push a fix in a minute into this PR. There is one more instance of OBJ_NEW(ompi_info_t) in fs_gpfs_file_get_info.c but I don't see that used anywhere, so it's hard to say what the intention is. Maybe @edgargabriel can say how the info object allocated there would be released?

@edgargabriel
Copy link
Member

@devreal I think you are right, this file is a relict of a time long gone, and we forgot to remove it. I do not see any path to invoke this function. We should probably remove it, I do not have however access to a gpfs file system anymore to ensure that we do not break something by removing it.

@bosilca
Copy link
Member

bosilca commented May 3, 2022

@devreal I don't think there is a problematic cycle because the mutex protecting the instance being a recursive mutex it will allow us to go through the fast path on the second call.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

@bosilca The mutex is not the problem. The problem is that the destruction of the ompi instance would depend on the destruction of the info object because the info object has retained the instance and would have to call ompi_info_instance_release first. The destruction of the info object depends on the destruction of the instance (which is bound to happen in ompi_mpi_instance_release when the instance ref count reaches zero and it is destroyed). That will never happen because of the circular dependency of the destructors.

No need to allocate it on the heap using OBJ_NEW. Also fixes
a mismatch between OBJ_NEW and ompi_info_free that potentially
leads to inconsistencies in ref-counting the ompi instance.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@jsquyres
Copy link
Member

jsquyres commented May 4, 2022

@devreal's argument sound right to me, but @bosilca and @devreal have looked much more deeply into the relevant code than me; I defer to you guys...

@devreal
Copy link
Contributor Author

devreal commented May 10, 2022

@bosilca and I had a quick chat offline about this and the outcome was that we both didn't know why info objects have to retain the ompi instance in the first place. My guess is that it has to do with the environment info but maybe that can be handled differently? @hjelmn Can you shed some light here?

@bosilca
Copy link
Member

bosilca commented May 10, 2022

To add to @devreal comment, my main issue with the info objects is that while the info object is derived from opal_object_t it should never be used as such, because using them with OBJ_NEW/OBJ_RELEASE will lead to inconsistencies with the instance object. We are basically creating two classes of objects derived from the same opal_object_t.

@dalcinl
Copy link
Contributor

dalcinl commented May 16, 2022

@devreal Any chance you could retarget target this PR to branch v5.0.x? I really hope this fix gets in for the upcoming 5.0.0 release.

@jsquyres
Copy link
Member

jsquyres commented May 17, 2022

@hppritcha Can you have a look at this, since a question has come up on this PR that involves sessions?

@devreal
Copy link
Contributor Author

devreal commented May 23, 2022

@dalcinl the backport v5.0.x is at #10420 and should make it into the 5.0 release.

@devreal devreal deleted the info_dup_allocate branch October 3, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants