You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
restoring more hwloc --cpu-set behavior from OMPI 3.x
My rough understanding of where some important masks come from is:
rdata->available = OMPI's allowed PUs, possibly from --cpu-set cmdline
hwloc_topology_get_allowed_cpuset() = hwloc maybe removing offline cpus
and/or cpus not in our cgroup
the old hwloc had fields
obj->online_cpuset
obj->allowed_cpuset
that I'm guessing is similar to hwloc_topology_get_allowed_cpuset()
Here's an example of the behavior this checkin changes:
Suppose a machine's hardware has
[ 0 1 2 3 / 4 5 6 7 / 8 9 10 11 / 12 13 14 15 ]
and we run with
--cpu-set 4,5,8,9,12,13,14,15
--map-by ppr:2:node:pe=3
-bind-to hwthread:overload-allowed
The allowed hardware threads would be
[..../xx../xx../xxxx]
and the 3.x code would place the ranks on
MCW 0 : [..../BB../B.../....]
MCW 1 : [..../..../.B../BB..]
which at least is using the allowed resources. The 4.x code doesn't pay as
much attention to rdata->available and was placing the ranks on
MCW 0 : [..../BBB./..../....]
MCW 1 : [..../..../BBB./....]
This part of the old code's behavior came from bind_downward() which would
cycle through a sequence of potential target objects checking
opal_hwloc_base_get_npus() under each potential object, and also using
opal_hwloc_base_get_available_cpus() for a selected object. Those
functions used to take into account rdata->available.
The changes I've made are to maintain old behavior in
opal_hwloc_base_get_npus(topo, obj)
and to add back in the opal_hwloc_base_get_available_cpus() function
that had been removed.
The old behavior of opal_hwloc_base_get_npus() had two paths:
1. df_search_cores which counts how many cores exist under obj that
intersect the hwloc_topology_get_allowed_cpuset()
(I think it likely should also intersect with rdata->available, but
it didn't before and I didn't change that aspect).
Here the OMPI 4.x code just used hwloc_get_nbobjs_inside_cpuset_by_type(,CORE)
so I restored the df_search_cores. It could likely be replaced by
intersecting with the available/allowed masks and then using the
get_nbobjs_inside.. call, but I went with the old code
2. opal_hwloc_base_get_available_cpus(topo,obj) that intersected the
allowed_cpuset as well as the rdata->available.
Here the OMPI 4.x code just used cpuset = obj->cpuset, so I restored
the call to intersect it with the allowed_cpuset from hwloc and the
rdata->available from ompi.
For the use-case at the top of this PR this is all the changes needed.
I did look a little further to see where else OMPI 3.x used
opal_hwloc_base_get_available_cpus(), as I believe every place that
removed that function is a place that used to account for rdata->available
that now doesn't.
So I also added opal_hwloc_base_get_available_cpus back into bind_in_place
even though I wasn't tracing that codepath.
I didn't restore all the uses I found in 3.x of the previously removed
opal_hwloc_base_get_available_cpus() function, instead only touching
code that seemed similar/related to code I was already touching.
0 commit comments