Skip to content

v4.0.x: -cpu-set as a constraint rather than as a binding #9298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Aug 19, 2021

The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
% mpirun -np 2 --report-bindings --use-hwthread-cpus
--bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:

MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that

Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
NULL != orte_rankfile
but also errors out if
(NULL != opal_hwloc_base_cpu_list &&
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
NULL != opal_hwloc_base_cpu_list
to
OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy. Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
/* did the user provide a slot list? */
if (NULL != opal_hwloc_base_cpu_list) {
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
}
where it used to (long ago) only do that if
!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname

MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search(). That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes. And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing. With that change in
place the first command works:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname

MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname

MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive. I only traced through enough code to fix the two specific
bugs I was dealing with.

Fixes #6540

Signed-off-by: Mark Allen markalle@us.ibm.com
(cherry picked from commit bdd92a7)

The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
>   New policy:   RANK_FILE
>   Prior policy:  BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
    NULL != orte_rankfile
but also errors out if
    (NULL != opal_hwloc_base_cpu_list &&
    !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
    NULL != opal_hwloc_base_cpu_list
to
    OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy.  Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }
where it used to (long ago) only do that if
    !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
    opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search().  That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
    available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes.  And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing.  With that change in
place the first command works:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive.  I only traced through enough code to fix the two specific
bugs I was dealing with.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit bdd92a7)
@awlauria
Copy link
Contributor Author

awlauria commented Aug 19, 2021

This is a long forgotten commit that never got ported to a release branch.

@awlauria awlauria added this to the v4.0.7 milestone Aug 19, 2021
@awlauria
Copy link
Contributor Author

Note - there was another commit in the PR from this cherry-pick (bf3980d), but from my testing that doesn't need to go back as it was resolved in another patch.

@awlauria
Copy link
Contributor Author

awlauria commented Aug 23, 2021

I hand verified that this fixes the referenced issue.

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awlauria Do we need both commits that came in on master #6584 ?

@awlauria
Copy link
Contributor Author

Note - there was another commit in the PR from this cherry-pick (bf3980d), but from my testing that doesn't need to go back as it was resolved in another patch.

@gpaulsen

@rhc54
Copy link
Contributor

rhc54 commented Aug 29, 2021

It has been a long time, but IIRC, this PR may have fixed the cited problems, but introduced breakage in other use-cases. I seem to recall we wound up fixing this another way on the master branch that avoided the collateral damage.

🤷‍♂️ Your choice - not sure how much trouble it would be to backport the master fix, or how important this fix is to the release branch.

@awlauria
Copy link
Contributor Author

awlauria commented Sep 2, 2021

I brought in these changes one-by-one, and can confirm that all changes are needed to make this option, --cpu-list, work. Even including the changes made to the hwloc files.

If this patch is to not be taken in to fix this bug - I wonder if instead we should disable --cpu-list, since as is on the v4 series it seems broken. Thoughts?

@awlauria
Copy link
Contributor Author

As of 9/21 ompi call, it was decided this is too risky to take this late in the v4 series. Instead we will print a warning saying that this option is broken, as well as update the man page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants