Discussion:
[gmx-users] Simulation freezes but the job keeps on running
Searle Duay
2018-01-25 17:39:11 UTC
Permalink
Good day!

I am running a 10 ns peptide-membrane simulation using GPUs from PSC
Bridges. The simulation starts properly, but it does not end on the time
that the simulation will end, as estimated by the software. The job is
still running and the simulation seems frozen because no simulation time is
added even after an hour of the job running.

I have submitted the following SLURM code:

#!/bin/bash
#SBATCH -J k80_1n_4g
#SBATCH -o %j.out
#SBATCH -N 1
#SBATCH -n 28
#SBATCH --ntasks-per-node=28
#SBATCH -p GPU
#SBATCH --gres=gpu:k80:4
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=***@uconn.edu

set echo
set -x

module load gromacs/2016_gpu

echo SLURM_NPROCS= $SLURM_NPROCS

cd $SCRATCH/prot_umbrella/gromacs/conv

mpirun -np $SLURM_NPROCS gmx_mpi mdrun -deffnm umbrella8 -pf
pullf-umbrella8.xvg -px pullx-umbrella8.xvg -v -ntomp 2

exit

I am not sure if the error is from the hardware or from my simulation
setup. I have already ran similar simulations (I just varied the number of
nodes that I am using, but same system), and some of them are successful.
There are just some which seems to freeze in the middle of the run.

Thank you!
--
Searle Aichelle S. Duay
Ph.D. Student
Chemistry Department, University of Connecticut
***@uconn.edu
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-***@gromacs.org.
Åke Sandgren
2018-01-25 18:08:53 UTC
Permalink
Is that build using openmpi?

We've seen cases when gromacs built with openmpi hangs repeatedly, while
the same build using intelmpi works.

We still haven't figured out why.
Post by Searle Duay
Good day!
I am running a 10 ns peptide-membrane simulation using GPUs from PSC
Bridges. The simulation starts properly, but it does not end on the time
that the simulation will end, as estimated by the software. The job is
still running and the simulation seems frozen because no simulation time is
added even after an hour of the job running.
#!/bin/bash
#SBATCH -J k80_1n_4g
#SBATCH -o %j.out
#SBATCH -N 1
#SBATCH -n 28
#SBATCH --ntasks-per-node=28
#SBATCH -p GPU
#SBATCH --gres=gpu:k80:4
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
set echo
set -x
module load gromacs/2016_gpu
echo SLURM_NPROCS= $SLURM_NPROCS
cd $SCRATCH/prot_umbrella/gromacs/conv
mpirun -np $SLURM_NPROCS gmx_mpi mdrun -deffnm umbrella8 -pf
pullf-umbrella8.xvg -px pullx-umbrella8.xvg -v -ntomp 2
exit
I am not sure if the error is from the hardware or from my simulation
setup. I have already ran similar simulations (I just varied the number of
nodes that I am using, but same system), and some of them are successful.
There are just some which seems to freeze in the middle of the run.
Thank you!
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-***@gromacs.org.
Searle Duay
2018-01-25 18:23:05 UTC
Permalink
Hi Ake,

I am not sure, and I don't know how to check the build. But, I see the
following in the output log file whenever I run GROMACS in PSC bridges:

GROMACS version: 2016
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.7.0
Tracing support: disabled
Built on: Fri Oct 7 15:06:50 EDT 2016
Built by: ***@gpu012.pvt.bridges.psc.edu [CMAKE]
Build OS/arch: Linux 3.10.0-327.4.5.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
Build CPU family: 6 Model: 63 Stepping: 2
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/lib64/ccache/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/lib64/ccache/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -std=c++0x -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler: /opt/packages/cuda/8.0RC/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Wed_May__4_21:01:56_CDT_2016;Cuda compilation tools, release 8.0, V8.0.26
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;\

-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funro\

ll-all-loops,-fexcess-precision=fast,,;
CUDA driver: 9.0
CUDA runtime: 8.0

Would that be built using openmpi?

Thanks!

Searle
Post by Åke Sandgren
Is that build using openmpi?
We've seen cases when gromacs built with openmpi hangs repeatedly, while
the same build using intelmpi works.
We still haven't figured out why.
Post by Searle Duay
Good day!
I am running a 10 ns peptide-membrane simulation using GPUs from PSC
Bridges. The simulation starts properly, but it does not end on the time
that the simulation will end, as estimated by the software. The job is
still running and the simulation seems frozen because no simulation time
is
Post by Searle Duay
added even after an hour of the job running.
#!/bin/bash
#SBATCH -J k80_1n_4g
#SBATCH -o %j.out
#SBATCH -N 1
#SBATCH -n 28
#SBATCH --ntasks-per-node=28
#SBATCH -p GPU
#SBATCH --gres=gpu:k80:4
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
set echo
set -x
module load gromacs/2016_gpu
echo SLURM_NPROCS= $SLURM_NPROCS
cd $SCRATCH/prot_umbrella/gromacs/conv
mpirun -np $SLURM_NPROCS gmx_mpi mdrun -deffnm umbrella8 -pf
pullf-umbrella8.xvg -px pullx-umbrella8.xvg -v -ntomp 2
exit
I am not sure if the error is from the hardware or from my simulation
setup. I have already ran similar simulations (I just varied the number
of
Post by Searle Duay
nodes that I am using, but same system), and some of them are successful.
There are just some which seems to freeze in the middle of the run.
Thank you!
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
--
Searle Aichelle S. Duay
Ph.D. Student
Chemistry Department, University of Connecticut
***@uconn.edu
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-us
Szilárd Páll
2018-01-25 19:41:29 UTC
Permalink
Post by Searle Duay
Hi Ake,
I am not sure, and I don't know how to check the build. But, I see the
GROMACS version: 2016
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.7.0
Tracing support: disabled
Built on: Fri Oct 7 15:06:50 EDT 2016
Build OS/arch: Linux 3.10.0-327.4.5.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU family: 6 Model: 63 Stepping: 2
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/lib64/ccache/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/lib64/ccache/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -std=c++0x -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler: /opt/packages/cuda/8.0RC/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Wed_May__4_21:01:56_CDT_2016;Cuda compilation tools, release 8.0, V8.0.26
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;\
-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-
Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funro\
ll-all-loops,-fexcess-precision=fast,,;
CUDA driver: 9.0
CUDA runtime: 8.0
Would that be built using openmpi?
Based on that it's hard to say. We don't detect MPI flavors, the only hint
from the version header would be the path to the compiler wrapper that
might indicate what was the MPI version used. However, in this case whoever
compiled GROMACS used ccache so we can't see the full path to an mpicc
binary.

I suggest that you consult your admins, perhaps try to use a different
MPI/version.


Åke, do you have any other data from out investigation (e.g. version/range
that reproduced the hangs, freqeuncy of hangs, size of the runs, etc.).

--
Szilárd
Post by Searle Duay
Thanks!
Searle
Post by Åke Sandgren
Is that build using openmpi?
We've seen cases when gromacs built with openmpi hangs repeatedly, while
the same build using intelmpi works.
We still haven't figured out why.
Post by Searle Duay
Good day!
I am running a 10 ns peptide-membrane simulation using GPUs from PSC
Bridges. The simulation starts properly, but it does not end on the
time
Post by Åke Sandgren
Post by Searle Duay
that the simulation will end, as estimated by the software. The job is
still running and the simulation seems frozen because no simulation
time
Post by Åke Sandgren
is
Post by Searle Duay
added even after an hour of the job running.
#!/bin/bash
#SBATCH -J k80_1n_4g
#SBATCH -o %j.out
#SBATCH -N 1
#SBATCH -n 28
#SBATCH --ntasks-per-node=28
#SBATCH -p GPU
#SBATCH --gres=gpu:k80:4
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
set echo
set -x
module load gromacs/2016_gpu
echo SLURM_NPROCS= $SLURM_NPROCS
cd $SCRATCH/prot_umbrella/gromacs/conv
mpirun -np $SLURM_NPROCS gmx_mpi mdrun -deffnm umbrella8 -pf
pullf-umbrella8.xvg -px pullx-umbrella8.xvg -v -ntomp 2
exit
I am not sure if the error is from the hardware or from my simulation
setup. I have already ran similar simulations (I just varied the number
of
Post by Searle Duay
nodes that I am using, but same system), and some of them are
successful.
Post by Åke Sandgren
Post by Searle Duay
There are just some which seems to freeze in the middle of the run.
Thank you!
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
--
Searle Aichelle S. Duay
Ph.D. Student
Chemistry Department, University of Connecticut
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-
Åke Sandgren
2018-01-25 20:14:04 UTC
Permalink
Post by Szilárd Páll
Åke, do you have any other data from out investigation (e.g. version/range
that reproduced the hangs, freqeuncy of hangs, size of the runs, etc.).
No hard data, but multiple versions of OpenMPI 2.x, various user cases
of different sizes, usually takes a copule of hours before it hangs.

I basically gave up after determining that IntelMPI solved the problem.
I just have too many other, more pressing, issues to deal with at the
moment.

With a bit of luck I'll be able to revisit this some time later. I do
have a specific case that always hangs though.
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-***@g
Loading...