Skip to content

Commit bdd92a7

Browse files
committed
-cpu-set as a constraint rather than as a binding
The first category of issue I'm addressing is that recent code changes seem to only consider -cpu-set as a binding option. Eg a command like this % mpirun -np 2 --report-bindings --use-hwthread-cpus \ --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname which just round robins over the --cpu-set list. Example output which seems fine to me: > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] It should also be possible though to pass a --cpu-set to most other map/bind options and have it be a constraint on that binding. Eg % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname The first command above errors that > Conflicting directives for mapping policy are causing the policy > to be redefined: > New policy: RANK_FILE > Prior policy: BYHWTHREAD The error check in orte_rmaps_rank_file_open() is likely too aggressive. The intent seems to be that any option like "--map-by whatever" will check to see if a rankfile is in use, and report that mapping via rmaps and using an explicit rankfile is a conflict. But the check has been expanded to not just check NULL != orte_rankfile but also errors out if (NULL != opal_hwloc_base_cpu_list && !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) which seems to be only recognizing -cpu-set as a binding option and ignoring -cpu-set as a constraint on other binding policies. For now I've changed the NULL != opal_hwloc_base_cpu_list to OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) so it hopefully only errors out if -cpu-set is being used as a binding policy. Whether I did that right or not it's enough to get to the next stage of testing the example commands I have above. Another place similar logic is used is hwloc_base_frame.c where it has /* did the user provide a slot list? */ if (NULL != opal_hwloc_base_cpu_list) { OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET); } where it used to (long ago) only do that if !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) I think the new code is making it impossible to use --cpu-set as anything other than a binding policy. That brings us past the error detection and into the real functionality, some of which has been stripped out, probably in moving to hwloc-2: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The rank_by() function in rmaps_base_ranking.c makes an array out of objects returned from opal_hwloc_base_get_obj_by_type(,,,i,) which uses df_search(). That function changed quite a bit from hwloc-1 to 2 but it used to include a check for available = opal_hwloc_base_get_available_cpus(topo, start) which is where the bitmask from --cpu-set goes. And it used to skip objs that had hwloc_bitmap_iszero(available). So I restored that behavior in ds_search() by adding a "constrained_cpuset" to replace start->cpuset that it was otherwise processing. With that change in place the first command works: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The other command uses a different path though that still ignored the available mask: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] In bind_generic() the code used to call opal_hwloc_base_find_min_bound_target_under_obj() which used opal_hwloc_base_get_ncpus(), and that's where it would intersect objects with the available cpuset and skip over ones that were't available. To match the old behavior I added a few lines in bind_generic() to skip over objects that don't intersect the available mask. After that we get % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] I think the above changes are improvements, but I don't feel like they're comprehensive. I only traced through enough code to fix the two specific bugs I was dealing with. Signed-off-by: Mark Allen <markalle@us.ibm.com>
1 parent 9ce8d7b commit bdd92a7

File tree

5 files changed

+47
-5
lines changed

5 files changed

+47
-5
lines changed

opal/mca/hwloc/base/hwloc_base_frame.c

+9-1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved.
44
* Copyright (c) 2016-2017 Research Organization for Information Science
55
* and Technology (RIST). All rights reserved.
6+
* Copyright (c) 2019 IBM Corporation. All rights reserved.
67
* $COPYRIGHT$
78
*
89
* Additional copyrights may follow
@@ -217,7 +218,14 @@ static int opal_hwloc_base_open(mca_base_open_flag_t flags)
217218
* we do bind to the given cpus if provided, otherwise this would be
218219
* ignored if someone didn't also specify a binding policy
219220
*/
220-
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
221+
// Restoring pre ef86707fbe3392c8ed15f79cc4892f0313b409af behavior.
222+
// Formerly -cpu-set #,#,# along with -use_hwthread-cpus resulted
223+
// in the binding policy staying OPAL_BIND_TO_HWTHREAD
224+
// I think that should be right because I thought -cpu-set was a contraint you put
225+
// on another binding policy, not a binding policy in itself.
226+
if (!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)) {
227+
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
228+
}
221229
}
222230

223231
/* if we are binding to hwthreads, then we must use hwthreads as cpus */

opal/mca/hwloc/base/hwloc_base_util.c

+22-3
Original file line numberDiff line numberDiff line change
@@ -765,15 +765,34 @@ static hwloc_obj_t df_search(hwloc_topology_t topo,
765765
return found;
766766
}
767767
if (OPAL_HWLOC_AVAILABLE == rtype) {
768+
// The previous (3.x) code included a check for
769+
// available = opal_hwloc_base_get_available_cpus(topo, start)
770+
// and skipped objs that had hwloc_bitmap_iszero(available)
771+
hwloc_obj_t root;
772+
opal_hwloc_topo_data_t *rdata;
773+
root = hwloc_get_root_obj(topo);
774+
rdata = (opal_hwloc_topo_data_t*)root->userdata;
775+
hwloc_cpuset_t constrained_cpuset;
776+
777+
constrained_cpuset = hwloc_bitmap_alloc();
778+
if (rdata && rdata->available) {
779+
hwloc_bitmap_and(constrained_cpuset, start->cpuset, rdata->available);
780+
} else {
781+
hwloc_bitmap_copy(constrained_cpuset, start->cpuset);
782+
}
783+
768784
unsigned idx = 0;
769785
if (num_objs)
770-
*num_objs = hwloc_get_nbobjs_inside_cpuset_by_depth(topo, start->cpuset, search_depth);
786+
*num_objs = hwloc_get_nbobjs_inside_cpuset_by_depth(topo, constrained_cpuset, search_depth);
771787
obj = NULL;
772-
while ((obj = hwloc_get_next_obj_inside_cpuset_by_depth(topo, start->cpuset, search_depth, obj)) != NULL) {
773-
if (idx == nobj)
788+
while ((obj = hwloc_get_next_obj_inside_cpuset_by_depth(topo, constrained_cpuset, search_depth, obj)) != NULL) {
789+
if (idx == nobj) {
790+
hwloc_bitmap_free(constrained_cpuset);
774791
return obj;
792+
}
775793
idx++;
776794
}
795+
hwloc_bitmap_free(constrained_cpuset);
777796
return NULL;
778797
}
779798
return NULL;

orte/mca/rmaps/base/rmaps_base_binding.c

+12
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
* Copyright (c) 2015-2017 Research Organization for Information Science
1717
* and Technology (RIST). All rights reserved.
1818
* Copyright (c) 2018 Inria. All rights reserved.
19+
* Copyright (c) 2019 IBM Corporation. All rights reserved.
1920
* $COPYRIGHT$
2021
*
2122
* Additional copyrights may follow
@@ -168,8 +169,19 @@ static int bind_generic(orte_job_t *jdata,
168169
trg_obj = NULL;
169170
min_bound = UINT_MAX;
170171
while (NULL != (tmp_obj = hwloc_get_next_obj_by_depth(node->topology->topo, target_depth, tmp_obj))) {
172+
hwloc_obj_t root;
173+
opal_hwloc_topo_data_t *rdata;
174+
root = hwloc_get_root_obj(node->topology->topo);
175+
rdata = (opal_hwloc_topo_data_t*)root->userdata;
176+
171177
if (!hwloc_bitmap_intersects(locale->cpuset, tmp_obj->cpuset))
172178
continue;
179+
// From the old 3.x code trg_obj was picked via a call to
180+
// opal_hwloc_base_find_min_bound_target_under_obj() which
181+
// skiped over unavailable objects (via opal_hwloc_base_get_npus).
182+
if (rdata && rdata->available && !hwloc_bitmap_intersects(rdata->available, tmp_obj->cpuset))
183+
continue;
184+
173185
data = (opal_hwloc_obj_data_t*)tmp_obj->userdata;
174186
if (NULL == data) {
175187
data = OBJ_NEW(opal_hwloc_obj_data_t);

orte/mca/rmaps/base/rmaps_base_ranking.c

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
* Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
1414
* Copyright (c) 2017 Research Organization for Information Science
1515
* and Technology (RIST). All rights reserved.
16+
* Copyright (c) 2019 IBM Corporation. All rights reserved.
1617
* $COPYRIGHT$
1718
*
1819
* Additional copyrights may follow

orte/mca/rmaps/rank_file/rmaps_rank_file_component.c

+3-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
* Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
1616
* Copyright (c) 2015 Los Alamos National Security, LLC. All rights
1717
* reserved.
18+
* Copyright (c) 2019 IBM Corporation. All rights reserved.
1819
* $COPYRIGHT$
1920
*
2021
* Additional copyrights may follow
@@ -106,7 +107,8 @@ static int orte_rmaps_rank_file_register(void)
106107
static int orte_rmaps_rank_file_open(void)
107108
{
108109
/* ensure we flag mapping by user */
109-
if ((NULL != opal_hwloc_base_cpu_list && !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) ||
110+
if ((OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) &&
111+
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) ||
110112
NULL != orte_rankfile) {
111113
if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) {
112114
/* if a non-default mapping is already specified, then we

0 commit comments

Comments
 (0)