Skip to content

Commit b2338de

Browse files
committed
removing an overly aggressive error check in binding
In bind_generic() there's a loop that picks a starting trg_obj and then walks through a loop of next = trg_obj->next_cousin until it has made total_cpus assignments. But the code doesn't accept that those assignments might not be adjacent objects. Example: % mpirun -np 2 --report-bindings --map-by ppr:2:node:pe=3 \ --cpu-set 4,5,7,8,9,11 -bind-to hwthread:overload-allowed > MCW 0 : [..../BB.B/..../....] > MCW 1 : [..../..../BB.B/....] It will want to assign 3 cpus and will loop through trg_obj 00001 (with ncpus 1) trg_obj 000001 (with ncpus 1) trg_obj 0000001 (with ncpus 0) trg_obj 000000011 (with ncpus 1) The original code on the third entry would see num_bound for the object become too high for its ncpus and think oversubscription was happening. I changed it to only ++num_bound eg to use that object if the object has cpus in its cpuset after intersected with the allowed/available masks. The error message from the original code (if you remove the overload-allowed) would be > A request was made to bind to that would result in binding more > processes than cpus on a resource: > Bind to: HWTHREAD > Node: ... > #processes: 1 > #cpus: 0
1 parent 072cf8d commit b2338de

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

orte/mca/rmaps/base/rmaps_base_binding.c

+3-1
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,9 @@ static int bind_generic(orte_job_t *jdata,
224224
data = OBJ_NEW(opal_hwloc_obj_data_t);
225225
trg_obj->userdata = data;
226226
}
227-
data->num_bound++;
227+
if (ncpus) {
228+
data->num_bound++;
229+
}
228230
/* error out if adding a proc would cause overload and that wasn't allowed,
229231
* and it wasn't a default binding policy (i.e., the user requested it)
230232
*/

0 commit comments

Comments
 (0)