Skip to content

Attributes: replace custom void* union with C union #10344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 3, 2022

Conversation

devreal
Copy link
Contributor

@devreal devreal commented May 1, 2022

The current implementation uses a void* to store different types of attribute value integers and attempts to figure out proper offsets for storing smaller integers in that pointer. The required pointer aliasing is UB and causes issues with GCC 11.

The new implementation replaces the self-built pointer-based union with a C union and selects the (pointer to the) right field based on the av_set_from value.

This patch also fixes a bug where copied attributes always had the set_from field set to C pointer, which worked but is technically not correct.

Supersedes #10343
Fixes #10339

Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu

Copy link
Contributor

@ggouaillardet ggouaillardet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on a big endian cpu with 8 bytes Fortran INTEGER?

I suspect that would not work when "converting" between (4 bytes) int and (8 bytes) INTEGER.

This is the reason I suggested we keep the *_pos logic but initialize at compile time based on pointer sizes and endianness.

@ggouaillardet
Copy link
Contributor

FWIW, an quick (and likely thread safe) workaround is to declare

    void * volatile bogus = (void*) 1;

(and keep bogus and p as local variables)

@devreal
Copy link
Contributor Author

devreal commented May 2, 2022

I suspect that would not work when "converting" between (4 bytes) int and (8 bytes) INTEGER.

I don't see where there would be an issue in this implementation. We would read the 4B int (av_int) and cast it to MPI_Fint (translate_to_fint) before passing it to the duplicate/delete callback (COPY_ATTR_CALLBACKS/DELETE_ATTR_CALLBACKS) or return it to the caller (ompi_attr_get_fint). Where would that fail? AFAICS, that is what the current implementation does, except that it crams everything into a void* and tries to be smart about the storage space. Obviously, GCC has become smarter, which is why we're spending hours on this. Really, that's what a C union is for.

void * volatile bogus = (void*) 1;

Sure, that's a bandaid that will work until it doesn't because some compiler will figure out that aliasing the storage of a void* as two int is still UB even though you slap volatile on it. Plus, I don't see the benefit of the current implementation over my proposal.

Looking at the original code, I think the pointer arithmetic was broken too:

> item->av_int_pointer = (int *)&item->av_value + int_pos;

Taking the address of item->av_value (void**) and adding int_pos would point av_int_pointer past av_value in the structure in any case where int_pos != 0. Incidentally, with int_pos == 1 it would point av_int_pointer to itself. I have no idea how that could have ever worked 🤷‍♂️

Never mind that last comment, got the cast precedence wrong...

@hjelmn
Copy link
Member

hjelmn commented May 2, 2022

I agree with @devreal . The existing code is not valid C and the compiler is now showing that. The correct fix is to use a union. Putting a bandaid on bad code is just going to lead to issues again in the future.

@@ -393,25 +391,30 @@ do { \
* Cases for attribute values
*/
typedef enum ompi_attribute_translate_t {
OMPI_ATTRIBUTE_INVALID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break internal ABI? Maybe do OMPI_ATTRIBUTE_INVALID = -1 to avoid shifting the existing values by one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or move it at the end as OMPI_ATTRIBUTE_MAX_VALID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no internal ABI, it's all in the same file. Changed it to -1 nevertheless.

@@ -393,25 +391,30 @@ do { \
* Cases for attribute values
*/
typedef enum ompi_attribute_translate_t {
OMPI_ATTRIBUTE_INVALID,
OMPI_ATTRIBUTE_C,
OMPI_ATTRIBUTE_INT,
OMPI_ATTRIBUTE_FINT,
OMPI_ATTRIBUTE_AINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit. While you are editing this file can you add a trailing comma?

@dalcinl
Copy link
Contributor

dalcinl commented May 2, 2022

@devreal I reran mpi4py's testsuite against your clone and branch, but it is still failing. I though that the issue was related to attributes, as the original report came from using Win attributes, but looks like something else broke during the last week or so.

full log: https://github.com/mpi4py/mpi4py-testing/runs/6256108177?check_suite_focus=true

...
----------------------------------------------------------------------
Ran 1624 tests in 38.083s

OK (skipped=284)
python: win/win.c:463: ompi_win_destruct: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (win->error_handler))->obj_magic_id' failed.
[fv-az173-499:164497] *** Process received signal ***
[fv-az173-499:164497] Signal: Aborted (6)
[fv-az173-499:164497] Signal code:  (-6)
[fv-az173-499:164497] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x430c0)[0x7f3e5bd890c0]
[fv-az173-499:164497] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3e5bd8903b]
[fv-az173-499:164497] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3e5bd68859]
[fv-az173-499:164497] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f3e5bd68729]
[fv-az173-499:164497] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x34006)[0x7f3e5bd7a006]
[fv-az173-499:164497] [ 5] /usr/local/lib/libmpi.so.0(+0xc0363)[0x7f3e5a389363]
[fv-az173-499:164497] [ 6] /usr/local/lib/libmpi.so.0(+0xbdfc6)[0x7f3e5a386fc6]
[fv-az173-499:164497] [ 7] /usr/local/lib/libmpi.so.0(+0xbe41e)[0x7f3e5a38741e]
[fv-az173-499:164497] [ 8] /usr/local/lib/libopen-pal.so.0(opal_finalize_cleanup_domain+0x42)[0x7f3e5a1bad8b]
[fv-az173-499:164497] [ 9] /usr/local/lib/libopen-pal.so.0(opal_finalize+0x5a)[0x7f3e5a1bb00e]
[fv-az173-499:164497] [10] /usr/local/lib/libmpi.so.0(ompi_rte_finalize+0x33f)[0x7f3e5a38666a]
[fv-az173-499:164497] [11] /usr/local/lib/libmpi.so.0(+0xc9c64)[0x7f3e5a392c64]
[fv-az173-499:164497] [12] /usr/local/lib/libmpi.so.0(ompi_mpi_instance_finalize+0x135)[0x7f3e5a392fdd]
[fv-az173-499:164497] [13] /usr/local/lib/libmpi.so.0(ompi_mpi_finalize+0x5f9)[0x7f3e5a37f41d]
[fv-az173-499:164497] [14] /usr/local/lib/libmpi.so.0(PMPI_Finalize+0x71)[0x7f3e5a3d2393]
[fv-az173-499:164497] [15] /opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x3db36)[0x7f3e5a7bab36]
[fv-az173-499:164497] [16] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(+0x104420)[0x7f3e5c04b420]
[fv-az173-499:164497] [17] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(Py_Exit+0xc)[0x7f3e5c04b955]
[fv-az173-499:164497] [18] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(+0x10710e)[0x7f3e5c04e10e]
[fv-az173-499:164497] [19] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(PyErr_PrintEx+0x1d)[0x7f3e5c1c903d]
[fv-az173-499:164497] [20] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(_PyRun_SimpleFileObject+0x3a0)[0x7f3e5c04de5f]
[fv-az173-499:164497] [21] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(_PyRun_AnyFileObject+0x88)[0x7f3e5c04ef36]
[fv-az173-499:164497] [22] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(Py_RunMain+0x3cf)[0x7f3e5c1d206f]
[fv-az173-499:164497] [23] /opt/hostedtoolcache/Python/3.10.4/x64/lib/libpython3.10.so.1.0(Py_BytesMain+0x3d)[0x7f3e5c1d1b1d]
[fv-az173-499:164497] [24] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f3e5bd6a0b3]
[fv-az173-499:164497] [25] python(_start+0x2e)[0x559e007d909e]
[fv-az173-499:164497] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node fv-az173-499 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Error: Process completed with exit code 134.

@devreal
Copy link
Contributor Author

devreal commented May 2, 2022

@dalcinl Thanks, I will look into it.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a union is a much better approach.

@jsquyres
Copy link
Member

jsquyres commented May 2, 2022

@devreal and I chatted in Slack about this. I pushed a 2nd commit to this PR with a few minor comment updates to ompi/attribute/attribute.c.

@jsquyres
Copy link
Member

jsquyres commented May 2, 2022

FWIW, the "test all 9 cases" attribute mini-test suite that we have (currently in the private test suite, but since I wrote 100% of the code in it, it would be straightforward to move this test suite to ompi-tests-public) would probably be useful here.

That being said:

  • When I run the attribute mini-test suite with everything built with gcc 11.2 and -O2 on RHEL 7 (on both OMPI main and v5.0.x branches), I cannot reproduce the error reported in attribute functions lead to application segfault #10339.
  • 72cfbb6 added support for 3 more attribute cases to the OMPI code base. Unfortunately, the attribute mini-test suite was not extended to test those additional 3 test cases.
    • I did not look closely at the error reported by @dalcinl and @jeffhammond, but it's possible that their error is coming from the 3 new cases that aren't covered by this test suite.
    • It would probably be good if someone could extend ompi-tests/simple/attr to include these 3 additional cases.

I had hoped that our "simple" attribute test suite would show the same error as was reported in #10339, but it's not. ☹️ Next step will be to try with the original test suite/code that reported the error. That might take me a few days -- anyone else is welcome to try before me...

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

I made a change to the COPY_ATTR_CALLBACKS macro and removed the set_from_aint/set_from_fint functions introduced earlier. This seems to be a corner case and I'm not 100% sure how to correctly handle it: assuming an attribute is created through either of the Fortran API and a value for that attribute is set through the C API and assuming that this is actually legal (if not the remainder is moot and there is no problem; it likely opens the gates of hell if the MPI-1 Fortran API is used, due to truncation of the pointer value). The av_set_from field is set to OMPI_ATTRIBUTE_C when the value is set. Later, the atttribute is duplicated, which invokes the respective Fortran copy function, either copying the value as MPI_Aint or MPI_Fint. Thus, in the new attribute the value should be marked as OMPI_ATTRIBUTE_AINT or OMPI_ATTRIBUTE_FINT, even though it originated from C. It doesn't really matter for OMPI_ATTRIBUTE_AINT but certainly matters for OMPI_ATTRIBUTE_FINT. @jsquyres do you have an opinion on that? The original code (by accident) had always marked the value in the new attribute as OMPI_ATTRIBUTE_C because that was the value set in the constructor...

I will squash the changes I pushed once approved, just wanted an easy path to restore these functions if need be.

@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

@dalcinl I was unable to reproduce the crash with mpi4py. I tried running with valgrind and found some spurious invalid reads like the one below but nothing that could corrupt the magic ID and no double free.

==3190251== Invalid read of size 4
==3190251==    at 0x58E553: ??? (in /usr/bin/python3.8)
==3190251==    by 0x5C658B: _PyModule_ClearDict (in /usr/bin/python3.8)
==3190251==    by 0x68485D: PyImport_Cleanup (in /usr/bin/python3.8)
==3190251==    by 0x67F8AE: Py_FinalizeEx (in /usr/bin/python3.8)
==3190251==    by 0x6B70FC: Py_RunMain (in /usr/bin/python3.8)
==3190251==    by 0x6B736C: Py_BytesMain (in /usr/bin/python3.8)
==3190251==    by 0x488E0B2: (below main) (libc-start.c:308)
==3190251==  Address 0x4ea0020 is 912 bytes inside a block of size 992 free'd
==3190251==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3190251==    by 0x5F26424: opal_hash_table_destruct (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5F9F6A8: opal_obj_run_destructors (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5F9FDC1: infosubscriber_destruct (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5B21BB5: opal_obj_run_destructors (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5B238C5: ompi_win_free (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5BC33B9: PMPI_Win_free (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x59257A4: __pyx_pf_6mpi4py_3MPI_3Win_22Free (MPI.c:159636)
==3190251==    by 0x59257A4: __pyx_pw_6mpi4py_3MPI_3Win_23Free (MPI.c:159606)
==3190251==    by 0x503B98: ??? (in /usr/bin/python3.8)
==3190251==    by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==3190251==    by 0x5F6835: _PyFunction_Vectorcall (in /usr/bin/python3.8)
==3190251==    by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==3190251==  Block was alloc'd at
==3190251==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3190251==    by 0x5F264B6: opal_hash_table_init2 (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5F26575: opal_hash_table_init (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5F9FAD9: infosubscriber_construct (in /home/joseph/opt/openmpi-master-dbg/lib/libopen-pal.so.0.0.0)
==3190251==    by 0x5B21B3D: opal_obj_run_constructors (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5B21C75: opal_obj_new (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5B21AA9: opal_obj_new_debug (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5B2264B: alloc_window (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5B22EA7: ompi_win_allocate (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5BBF114: PMPI_Win_allocate (in /home/joseph/opt/openmpi-master-dbg/lib/libmpi.so.0.0.0)
==3190251==    by 0x5976C7C: PyMPI_Win_allocate_c (largecnt.h:1826)
==3190251==    by 0x5976C7C: __pyx_pf_6mpi4py_3MPI_3Win_10Allocate (MPI.c:158444)
==3190251==    by 0x5976C7C: __pyx_pw_6mpi4py_3MPI_3Win_11Allocate (MPI.c:158378)
==3190251==    by 0x5F3988: PyCFunction_Call (in /usr/bin/python3.8)

@dalcinl
Copy link
Contributor

dalcinl commented May 3, 2022

@devreal I found a trivial reproducer. I'm using the branch from this PR. However, at this point I'm not sure the issue is related to attributes.

from mpi4py import MPI

for i in range(14):
    MPI.INFO_ENV.Dup().Free()

Iterating 14 times (or more) as in the snippet above triggers the assertion, but using 13 or less, all is good and valgrind is clean.

$ python tmp.py 
python: win/win.c:463: ompi_win_destruct: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (win->error_handler))->obj_magic_id' failed.
[localhost:591040] *** Process received signal ***
[localhost:591040] Signal: Aborted (6)
[localhost:591040] Signal code:  (-6)
[localhost:591040] [ 0] /lib64/libc.so.6(+0x59da0)[0x7fbb32216da0]
...
[localhost:591040] [22] python(_start+0x25)[0x55a2b0cb5095]
[localhost:591040] *** End of error message ***
Aborted (core dumped)

@devreal devreal force-pushed the attribute-unions branch from d8950ce to 396166d Compare May 3, 2022 13:21
@jsquyres
Copy link
Member

jsquyres commented May 3, 2022

@dalcinl Thanks for the reproducer. That led @devreal to find the fix -- coming on a different PR (because it's unrelated to attributes). We still want the attribute union update from this PR, but we'll keep these fixes separate.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's squash.

devreal and others added 2 commits May 3, 2022 11:44
The current implementation uses a void* to store different types of
attribute value integers and attempts to figure out proper offsets
for storing smaller integers in that pointer. The required pointer
aliasing is UB and causes issues with GCC 11.

The new implementation replaces the self-built pointer-based union
with a C union and selects the (pointer to the) right field based
on the av_set_from value.

This patch also fixes a bug where copied attributes always had the
set_from field set to C pointer, which worked but is technically not
correct.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Correct a few minor mistakes in the large comment at the top of
ompi/attribute/attribute.c from when 72cfbb6 added several new
cases to attribute handling.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@devreal devreal force-pushed the attribute-unions branch from 396166d to cce89e8 Compare May 3, 2022 15:44
@devreal
Copy link
Contributor Author

devreal commented May 3, 2022

Squashed, will merge once tests have passed.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Oct 11, 2022 via email

@jeffhammond
Copy link
Contributor

@ggouaillardet A large number of things - including all parameters of root and count - in MPI implementations break when users force Fortran INTEGER to be incompatible with C int. No implementation I'm aware of supports this, and I'm not aware of any attempt to implement safe casting from 64b to 32b integers at the boundary between Fortran and C.

Nothing in the Fortran or MPI standard says this should work, and I don't think it's good for anyone for implementers to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

attribute functions lead to application segfault
7 participants