Skip to content

--host, binding and cpuset does not seem to work #6966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zerothi opened this issue Sep 9, 2019 · 14 comments
Closed

--host, binding and cpuset does not seem to work #6966

zerothi opened this issue Sep 9, 2019 · 14 comments

Comments

@zerothi
Copy link
Contributor

zerothi commented Sep 9, 2019

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

ompi: 3.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Manual installation.

Please describe the system on which you are running

  • Operating system/version:
    ScientificLinux 7.3
  • Computer hardware:
    EPYC 7551 (currently just testing single node)
  • Network type:
    not important (single node)

Details of the problem

I would like to fully control MPI ranking and binding on the command-line interface and optionally do it with --host assignments.

In particular I would like to run the OSU benchmarks and do some benchmarks.
However, I have the same problem for simple codes.

In the following I will use this code (for brevity)

program test
use mpi
character(len=MPI_MAX_PROCESSOR_NAME) :: name
integer :: i, n, namel
call MPI_Init(i)
call MPI_Comm_Rank(MPI_COMM_WORLD, n, i)
call MPI_Get_Processor_Name(name, namel, i)
print *, n, name(1:namel)
call MPI_Finalize(i)
end

Cluster script:

I am requesting 64 cores on a 2-socket EPYC7551 machine (totalling 64 cores).

#!/bin/bash
#BSUB -n 64
#BSUB -R "select[model == EPYC7551]"
#BSUB -R "rusage[mem=100MB]"
#BSUB -q epyc
#BSUB -W 1:00

# Read array of affinity settings
readarray affinity < affinity.$LSB_JOBID
# See #6631 which I created some time ago (and is fixed in master)
unset LSB_AFFINITY_HOSTFILE
size=${#affinity[@]}

for i in $(seq 0 $((size-1)))
do
    hosti=$(echo ${affinity[$i]} | awk '{print $1}')
    cpuseti=$(echo ${affinity[$i]} | awk '{print $2}')
    for j in $(seq $((i+1)) $((size-1)))
    do
	hostj=$(echo ${affinity[$j]} | awk '{print $1}')
	cpusetj=$(echo ${affinity[$j]} | awk '{print $2}')

	# 1. Direct CPU-set
	mpirun --report-bindings \
	    -np 2 --cpu-set $cpuseti,$cpusetj ./run \
	    > direct.$hosti.$cpuseti-$hostj.$cpusetj
	
	# 2. Explicit sub (report-bindings first)
	mpirun --report-bindings \
	    -np 1 --host $hosti --cpu-set $cpuseti ./run \
	    : \
	    -np 1 --host $hostj --cpu-set $cpusetj ./run \
	    > sub-0.$hosti.$cpuseti-$hostj.$cpusetj

	# 3. Explicit sub (report-bindings 1)
	mpirun --report-bindings \
	    -np 1 --host $hosti --cpu-set $cpuseti --report-bindings ./run \
	    : \
	    -np 1 --host $hostj --cpu-set $cpusetj ./run \
	    > sub-1.$hosti.$cpuseti-$hostj.$cpusetj

	# 4. Explicit sub (report-bindings 2)
	mpirun --report-bindings \
	    -np 1 --host $hosti --cpu-set $cpuseti ./run \
	    : \
	    -np 1 --host $hostj --cpu-set $cpusetj --report-bindings ./run \
	    > sub-2.$hosti.$cpuseti-$hostj.$cpusetj

	# 5. Explicit sub (report-bindings 1 and 2)
	mpirun --report-bindings \
	    -np 1 --host $hosti --cpu-set $cpuseti --report-bindings ./run \
	    : \
	    -np 1 --host $hostj --cpu-set $cpusetj --report-bindings ./run \
	    > sub-1-2.$hosti.$cpuseti-$hostj.$cpusetj

	# 6. Explicit affinity setting via env-var
	{
	    echo ${affinity[$i]}
	    echo ${affinity[$j]}
	} > test.affinity
	cat test.affinity
	LSB_AFFINITY_HOSTFILE=$(pwd)/test.affinity
	mpirun -np 2 --report-bindings $OSU_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency \
	    > affinity.$hosti.$cpuseti-$hostj.$cpusetj
	unset LSB_AFFINITY_HOSTFILE
	
    done
done

Explaining the script

Although I am allocating entire nodes and not using everything I would still expect OpenMPI to obey my requested bindings.

  1. A single host only requires the cpu-set. A comma-separated list should be enough.
    This gives me:
[n-62-27-29:01122] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././.]
[n-62-27-29:01122] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././.]

regardless of --cpu-set <args>

2-6. All yield exactly the same output:

[n-62-27-29:04404] MCW rank 0 is not bound (or bound to all available processors)
[n-62-27-29:04404] MCW rank 1 is not bound (or bound to all available processors)

I have also tried adding --bind-to core with the same output.

Possibly related issues:

@zerothi
Copy link
Contributor Author

zerothi commented Sep 10, 2019

Let me note that I did those different --report-bindings tests because on my local machine (Debian 10.1, otherwise openmpi and GCC is the same).

Then I get

$> mpirun --report-bindings --bind-to core -np 1 --cpu-set 0 ./a.out : -np 1 --cpu-set 1 ./a.out 
[nicpa-dtu:10600] MCW rank 0 is not bound (or bound to all available processors)
[nicpa-dtu:10600] MCW rank 1 is not bound (or bound to all available processors)
$> mpirun --bind-to core -np 1 --cpu-set 0 --report-bindings ./a.out : -np 1 --cpu-set 1 ./a.out 
[nicpa-dtu:10611] MCW rank 0 is not bound (or bound to all available processors)
[nicpa-dtu:10611] MCW rank 1 is not bound (or bound to all available processors)
$> mpirun --bind-to core -np 1 --cpu-set 0  ./a.out : -np 1 --cpu-set 1 --report-bindings ./a.out 
[nicpa-dtu:10628] MCW rank 1 bound to SK0:L30:L20:L10:CR0:HT0-1
$> mpirun --bind-to core -np 1 --cpu-set 0  --report-bindings ./a.out : -np 1 --cpu-set 1 --report-bindings ./a.out 
[nicpa-dtu:10633] MCW rank 0 is not bound (or bound to all available processors)
[nicpa-dtu:10633] MCW rank 1 is not bound (or bound to all available processors)

note the 3rd line!

@rhc54
Copy link
Contributor

rhc54 commented Sep 10, 2019

I'll check - it is possible that the confusion lies in the output. If the cpu-set is a single processor and you tell us to bind-to core, then the proc sees that "all available processors" is just the one that it is executing upon - which it interprets as being "bound to all available processors" as the message says.

@zerothi
Copy link
Contributor Author

zerothi commented Sep 11, 2019

@rhc54 ok, but I would have suspected something like:

$> mpirun --report-bindings --bind-to core -np 2 ./a.out
[nicpa-dtu:07101] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/..]
[nicpa-dtu:07101] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB]

Also, doing:

$> mpirun --report-bindings --bind-to core -np 1 --cpu-set 0 ./a.out : --bind-to core -np 1 --cpu-set 2 ./a.out
[nicpa-dtu:07197] MCW rank 0 is not bound (or bound to all available processors)
[nicpa-dtu:07197] MCW rank 1 is not bound (or bound to all available processors)

However, if I do (with an application file):

$> cat appfile
--report-bindings --bind-to core -np 1 numactl --physcpubind=0 ./a.out
--report-bindings --bind-to core -np 1 numactl --physcpubind=1 ./a.out

I get

$> mpirun --app appfile
[nicpa-dtu:08893] MCW rank 0 bound to SK0:L30:L20:L10:CR0:HT0-1
[nicpa-dtu:08894] MCW rank 1 bound to SK0:L30:L21:L11:CR1:HT2-3

which looks correct (although I am not used to the SK*... notation?

@rhc54
Copy link
Contributor

rhc54 commented Sep 11, 2019

What do you get if you run:

mpirun --report-bindings --bind-to core -np 1 --cpu-set 0 numactl --show : --bind-to core -np 1 --cpu-set 2 numactl --show

@zerothi
Copy link
Contributor Author

zerothi commented Sep 11, 2019

On my local machine I get:

$> mpirun --report-bindings --bind-to core -np 1 --cpu-set 0 numactl --show : --bind-to core -np 1 --cpu-set 2 numactl --show                                                                                                               

policy: default
preferred node: current
physcpubind: 0 2 
cpubind: 0 
nodebind: 0 
membind: 0 
[nicpa-dtu:12015] MCW rank 0 is not bound (or bound to all available processors)
[nicpa-dtu:12015] MCW rank 1 is not bound (or bound to all available processors)
policy: default
preferred node: current
physcpubind: 0 2 
cpubind: 0 
nodebind: 0 
membind: 0 

@zerothi
Copy link
Contributor Author

zerothi commented Sep 11, 2019

With the --app I get:

$> cat appfile                                                                                                                                                                                                                              
--report-bindings --bind-to core -np 1 numactl --physcpubind=0 --show
--report-bindings --bind-to core -np 1 numactl --physcpubind=1 --show
$> mpirun --app appfile                                                                                                                                                                                                                     
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 
policy: default
preferred node: current
physcpubind: 1 
cpubind: 0 
nodebind: 0 
membind: 0 

@zerothi
Copy link
Contributor Author

zerothi commented Sep 11, 2019

I added two more tests on the cluster script:

# 7. Create an appfile
    {
        echo --report-bindings --bind-to none -np 1 --host $hosti numactl --physcpubind=$cpuseti ./run
        echo --report-bindings --bind-to none -np 1 --host $hostj numactl --physcpubind=$cpusetj ./run
    } > appfile
    mpirun --app appfile > appfile-none.$hosti.$cpuseti-$hostj.$cpusetj
    
    # 8. Create an appfile
    {
        echo --report-bindings --bind-to core -np 1 --host $hosti numactl --physcpubind=$cpuseti ./run
        echo --report-bindings --bind-to core -np 1 --host $hostj numactl --physcpubind=$cpusetj ./run
    } > appfile
    mpirun --app appfile > appfile-core.$hosti.$cpuseti-$hostj.$cpusetj

In this case I find that 7. works correctly (YAY!): for cpuseti=0, cpusetj=3 I get:

[n-62-27-29:47440] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././.]
[n-62-27-29:47441] MCW rank 1 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././.]

However, for 8. I get:

libnuma: Warning: cpu argument 3 is out of range

which suggests that the binding for MPI (and thus CPUSET) is limited prior to the execution, however, specifying --physcpubind=0 does not work either.

I hope this can be used to dig out what is happening and how to use OMPI to control this. Using simpler command-line-arguments would be nice for end-users, since the numactl is a bit magic to regular users (and sometimes also to me ;) ).

@rhc54
Copy link
Contributor

rhc54 commented Sep 12, 2019

A couple of us looked into this and found that:

  • cpu-set appears to be actually binding things, but not the way one would naturally expect
  • there is another option (cpu-list) which appears to have a somewhat overlapping purpose (at least, from the man page output) and results in identical behavior
  • neither of us could fully describe what we wanted to have happen with either option, nor why two options were required

We concluded that trying to backport a fix to the v3.x series was unlikely to be easy nor small, which made the release manager for that series reluctant to consider accepting it. Fixing it for v5.x is definitely something we will do, and it might (depending on the solution) be acceptable to the release managers to backport that fix to v4.x.

Meantime, a workaround that appeared to work for us was:

mpirun --map-by core --bind-to core -np 1 --cpu-set 0,3 app1 : -np 1 app2

This tells mpirun to utilize an envelope of cpus 0 and 3, and to map/bind the procs by core within that envelope. So the first proc in the job will be bound to cpu0 and the second proc in the job (which is the first proc of app2) will be bound to cpu3.

Give that a try and see if it works for you.

@zerothi
Copy link
Contributor Author

zerothi commented Sep 13, 2019

Thanks! I am ok with having this for 5, but it would be really nice to have in 4. ;)
FYI, I also played with cpu-list, to no avail as you've found.

IFF the options are not working, it should probably be added to the documentation for the next 3.X release to ensure people are aware of it? (well, not too much trouble for me, :))

I have also tried your suggestion. However, I can't get it to work:

$> mpirun --map-by core --bind-to core -np 1 --cpu-set $cpuseti,$cpusetj ./run : -np 1 ./run 
--------------------------------------------------------------------------
Conflicting directives for mapping policy are causing the policy
to be redefined:

  New policy:   RANK_FILE
  Prior policy:  BYCORE

Please check that only one policy is defined.
--------------------------------------------------------------------------
$> mpirun --map-by core -np 1 --cpu-set $cpuseti,$cpusetj ./run : -np 1 ./run 
--------------------------------------------------------------------------
Conflicting directives for mapping policy are causing the policy
to be redefined:

  New policy:   RANK_FILE
  Prior policy:  BYCORE

Please check that only one policy is defined.
--------------------------------------------------------------------------
$> mpirun --bind-to core -np 1 --cpu-set $cpuseti,$cpusetj ./run : -np 1 ./run 
[n-62-27-29:05179] MCW rank 0 is not bound (or bound to all available processors)
[n-62-27-29:05179] MCW rank 1 is not bound (or bound to all available processors)

It doesn't seem to work. :(

Also, the above won't work on 2 different hosts.

Well, it seems that the app-file solution is sufficient and solves all problems.

If you want me to test more, please do not hesitate to contact me.

@awlauria
Copy link
Contributor

This should be retested with master/v5.0.x with prrte.

@awlauria
Copy link
Contributor

awlauria commented Jun 29, 2021

Adding the blocker label for now - but it may not be.

@awlauria
Copy link
Contributor

awlauria commented Jul 1, 2021

This might be fixed in v5.0.x, just needs retesting.

@awlauria
Copy link
Contributor

awlauria commented Aug 3, 2021

Re-testing this on current master, it does appear to be fixed.

Here's some example runs with master/prrte (v5.0) -

Note that --cpu-set changed to --map-by and --display-bindings changed to --display bind

Local host:

$ ./exports/bin/mpirun -np 2 --map-by :PE-LIST=3,4 --display bind --host "hostA:2" hostname
[hostA:3738581] MCW rank 0 bound to package[0][core:3]
[hostA:3738581] MCW rank 1 bound to package[0][core:4]
hostA
hostA

Remote host:

$ ./exports/bin/mpirun -np 2 --map-by :PE-LIST=3,4 --display bind --host "hostB:2" hostname
[hostB:733813] MCW rank 0 bound to package[0][core:3]
[hostB:733813] MCW rank 1 bound to package[0][core:4]
hostB
hostB

I think we can remove the v5 label, and possibly close this if there is no intention of bringing any required fixes to v4.

@awlauria
Copy link
Contributor

awlauria commented Aug 3, 2021

Confirmed also works with the current v5.0.x branch.

@awlauria awlauria closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants