-cpu-set as a constraint rather than as a binding

markalle · markalle · commit bdd92a7a643f · 2019-04-12T15:33:56.000-04:00
The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:
&gt; MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
&gt; MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that
&gt; Conflicting directives for mapping policy are causing the policy
&gt; to be redefined:
&gt;   New policy:   RANK_FILE
&gt;   Prior policy:  BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
    NULL != orte_rankfile
but also errors out if
    (NULL != opal_hwloc_base_cpu_list &amp;&amp;
    !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
    NULL != opal_hwloc_base_cpu_list
to
    OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy.  Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }
where it used to (long ago) only do that if
    !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
&gt; MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
&gt; MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
    opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search().  That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
    available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes.  And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start-&gt;cpuset that it was otherwise processing.  With that change in
place the first command works:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
&gt; MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
&gt; MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
&gt; MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
&gt; MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
&gt; MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
&gt; MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive.  I only traced through enough code to fix the two specific
bugs I was dealing with.

Signed-off-by: Mark Allen &lt;markalle@us.ibm.com&gt;
diff --git a/opal/mca/hwloc/base/hwloc_base_frame.c b/opal/mca/hwloc/base/hwloc_base_frame.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2013-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2016-2017 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
+ * Copyright (c) 2019 IBM Corporation. All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -217,7 +218,14 @@ static int opal_hwloc_base_open(mca_base_open_flag_t flags)
          * we do bind to the given cpus if provided, otherwise this would be
          * ignored if someone didn't also specify a binding policy
          */
-        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
+// Restoring pre ef86707fbe3392c8ed15f79cc4892f0313b409af behavior.
+// Formerly -cpu-set #,#,# along with -use_hwthread-cpus resulted
+// in the binding policy staying OPAL_BIND_TO_HWTHREAD
+// I think that should be right because I thought -cpu-set was a contraint you put
+// on another binding policy, not a binding policy in itself.
+        if (!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)) {
+            OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
+        }
     }
 
     /* if we are binding to hwthreads, then we must use hwthreads as cpus */
diff --git a/opal/mca/hwloc/base/hwloc_base_util.c b/opal/mca/hwloc/base/hwloc_base_util.c
@@ -765,15 +765,34 @@ static hwloc_obj_t df_search(hwloc_topology_t topo,
         return found;
     }
     if (OPAL_HWLOC_AVAILABLE == rtype) {
+// The previous (3.x) code included a check for
+// available = opal_hwloc_base_get_available_cpus(topo, start)
+// and skipped objs that had hwloc_bitmap_iszero(available)
+        hwloc_obj_t root;
+        opal_hwloc_topo_data_t *rdata;
+        root = hwloc_get_root_obj(topo);
+        rdata = (opal_hwloc_topo_data_t*)root->userdata;
+        hwloc_cpuset_t constrained_cpuset;
+
+        constrained_cpuset = hwloc_bitmap_alloc();
+        if (rdata && rdata->available) {
+            hwloc_bitmap_and(constrained_cpuset, start->cpuset, rdata->available);
+        } else {
+            hwloc_bitmap_copy(constrained_cpuset, start->cpuset);
+        }
+
         unsigned idx = 0;
         if (num_objs)
-            *num_objs = hwloc_get_nbobjs_inside_cpuset_by_depth(topo, start->cpuset, search_depth);
+            *num_objs = hwloc_get_nbobjs_inside_cpuset_by_depth(topo, constrained_cpuset, search_depth);
         obj = NULL;
-        while ((obj = hwloc_get_next_obj_inside_cpuset_by_depth(topo, start->cpuset, search_depth, obj)) != NULL) {
-            if (idx == nobj)
+        while ((obj = hwloc_get_next_obj_inside_cpuset_by_depth(topo, constrained_cpuset, search_depth, obj)) != NULL) {
+            if (idx == nobj) {
+                hwloc_bitmap_free(constrained_cpuset);
                 return obj;
+            }
             idx++;
         }
+        hwloc_bitmap_free(constrained_cpuset);
         return NULL;
     }
     return NULL;
diff --git a/orte/mca/rmaps/base/rmaps_base_binding.c b/orte/mca/rmaps/base/rmaps_base_binding.c
@@ -16,6 +16,7 @@
  * Copyright (c) 2015-2017 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2018      Inria.  All rights reserved.
+ * Copyright (c) 2019 IBM Corporation. All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -168,8 +169,19 @@ static int bind_generic(orte_job_t *jdata,
         trg_obj = NULL;
         min_bound = UINT_MAX;
         while (NULL != (tmp_obj = hwloc_get_next_obj_by_depth(node->topology->topo, target_depth, tmp_obj))) {
+            hwloc_obj_t root;
+            opal_hwloc_topo_data_t *rdata;
+            root = hwloc_get_root_obj(node->topology->topo);
+            rdata = (opal_hwloc_topo_data_t*)root->userdata;
+
             if (!hwloc_bitmap_intersects(locale->cpuset, tmp_obj->cpuset))
                 continue;
+// From the old 3.x code trg_obj was picked via a call to
+// opal_hwloc_base_find_min_bound_target_under_obj() which
+// skiped over unavailable objects (via opal_hwloc_base_get_npus).
+            if (rdata && rdata->available && !hwloc_bitmap_intersects(rdata->available, tmp_obj->cpuset))
+                continue;
+
             data = (opal_hwloc_obj_data_t*)tmp_obj->userdata;
             if (NULL == data) {
                 data = OBJ_NEW(opal_hwloc_obj_data_t);
diff --git a/orte/mca/rmaps/base/rmaps_base_ranking.c b/orte/mca/rmaps/base/rmaps_base_ranking.c
@@ -13,6 +13,7 @@
  * Copyright (c) 2014-2018 Intel, Inc.  All rights reserved.
  * Copyright (c) 2017      Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
+ * Copyright (c) 2019 IBM Corporation. All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
diff --git a/orte/mca/rmaps/rank_file/rmaps_rank_file_component.c b/orte/mca/rmaps/rank_file/rmaps_rank_file_component.c
@@ -15,6 +15,7 @@
  * Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2015      Los Alamos National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2019 IBM Corporation. All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -106,7 +107,8 @@ static int orte_rmaps_rank_file_register(void)
 static int orte_rmaps_rank_file_open(void)
 {
     /* ensure we flag mapping by user */
-    if ((NULL != opal_hwloc_base_cpu_list && !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) ||
+    if ((OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) &&
+        !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) ||
         NULL != orte_rankfile) {
         if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) {
             /* if a non-default mapping is already specified, then we

Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,7 @@`
`13`	`13`	`* Copyright (c) 2014-2018 Intel, Inc. All rights reserved.`
`14`	`14`	`* Copyright (c) 2017 Research Organization for Information Science`
`15`	`15`	`* and Technology (RIST). All rights reserved.`
	`16`	`+ * Copyright (c) 2019 IBM Corporation. All rights reserved.`
`16`	`17`	`* $COPYRIGHT$`
`17`	`18`	`*`
`18`	`19`	`* Additional copyrights may follow`