Skip to content

Attempt to free memory that is still in use by an ongoing MPI communication #3268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artpol84 opened this issue Apr 1, 2017 · 18 comments
Closed

Comments

@artpol84
Copy link
Contributor

artpol84 commented Apr 1, 2017

We observed this error only once in our MTT. We are running v2.x with SLURM/pmix there. It is possible that it is somehow related to this configuration, though I doubt that.
Here is the error message:

export OMPI_MCA_btl_openib_if_include=mlx4_0:1
OMPI_MCA_btl_openib_if_include=mlx4_0:1
OMPI_MCA_mpi_add_procs_cutoff=0+ OMPI_MCA_pmix_base_async_modex=1
OMPI_MCA_pmix_base_collect_data=0
/tmp/mtt_116453_slurm/bin/srun -N 8 -n 64 --mpi=pmix_v1 -p pmellanox <mtt-base>/installs/T8JL/tests/mpich_tests/mpich-mellanox.git/test/mpi/coll/allgather2
[boo13:10605] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0xa89000, size 9302016).  MPI job will now abort.
srun: error: boo13: task 9: Exited with exit code 1
srun: Terminating job step 1592.0
@artpol84
Copy link
Contributor Author

artpol84 commented Apr 3, 2017

Update: this has triggered today again.
First time I noticed that was on Thursday, 03/30/2017.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 3, 2017

I always observe this with allgather2 test (but it's not 100% reproducible), so in conjunction with the time it first appeared and the commit history, I can say that this may be related to #3159

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 3, 2017

@vspetrov

@hppritcha
Copy link
Member

These MTT runs were done with vanilla Open MPI, no UCX, MXM, etc.

@gpaulsen
Copy link
Member

From web-ex 04/11 - Not 100% reproducible, so lets keep this issue open a bit longer.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 12, 2017

still reproducible:
recent mtt report

@jsquyres
Copy link
Member

Artem -- don't forget that you can click on the "Absolute date range" link on the right in MTT and get a short link: https://mtt.open-mpi.org/index.php?do_redir=2412

@artpol84
Copy link
Contributor Author

Thanks, I didn't know that: https://mtt.open-mpi.org/index.php?do_redir=2412

@hppritcha hppritcha modified the milestones: v2.1.2, v2.1.1 Apr 24, 2017
@jsquyres
Copy link
Member

@artpol84 Where are we on this issue?

@artpol84
Copy link
Contributor Author

@jsquyres I don't see this error anymore in our MTT.

@jsquyres
Copy link
Member

Sweet! I'll close.

@lcebaman
Copy link

I am getting exactly this error with OpenMPI 3.0.0 and using RMA (multithreaded) in my code. This is reproducible and it does not happen with MPICH. Any ideas?

@jsquyres
Copy link
Member

@icebaman Can you open a new issue with a small reproducer code?

@chchang6
Copy link

Hi,
We have a user who is running into this at the moment on our system, with OpenMPI 1.10.7 built against GCC 7.3.0. (Version 1 is because we are running into startup/hanging issues with newer versions at larger scales). Was this thread ever continued?

@jsquyres
Copy link
Member

I'm afraid not. Have you tried Open MPI v4.x?

@chchang6
Copy link

Hi Jeff, no but I also got that advice on another ticket for a different issue, so will try that shortly. Thanks!

@jsquyres
Copy link
Member

Yeah, sorry, 1.10.x is ancient and not really supported any more.

@chchang6
Copy link

No worries; the only reason folks are running 1.10.X is that 2 and 3 are having scalability problems in our environment. Suspect it has something to do with startup mechanism, Slurm config, and our network, but hope 4 works out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants