p buscemi
2018-12-10 14:24:43 UTC
Thank you, Mark, for the prompt response. I realize the limitations of the system ( its over 8 yo ), but I did not expect the speed to decrease by 50% with 12 available threads ! No combination of ntomp, ntmpi could raise ns/day above 4 with two GPU, vs 6 with one GPU.
This is actually a learning/practice run for a new build - an AMD 4.2 Ghz 32 core TR, 64G ram. In this case I am trying to decide upon either a RTX 2080 ti or two GTX 1080 TI. I'd prefer the two 1080's for the 7000 cores vs the 4500 cores of the 2080. The model systems will have ~ million particles and need the speed. But this is a major expense so I need to get it right.
I'll do as you suggest and report the results for both systems and I really appreciate the assist.
Paul
UMN, BICB
This is actually a learning/practice run for a new build - an AMD 4.2 Ghz 32 core TR, 64G ram. In this case I am trying to decide upon either a RTX 2080 ti or two GTX 1080 TI. I'd prefer the two 1080's for the 7000 cores vs the 4500 cores of the 2080. The model systems will have ~ million particles and need the speed. But this is a major expense so I need to get it right.
I'll do as you suggest and report the results for both systems and I really appreciate the assist.
Paul
UMN, BICB
Dear Users,
I have good luck using a single GPU with the basic setup.. However in going from one gtx 1060 to a system with two - 50,000 atoms - the rate decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent ( water ) and a lipid membrane
the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram.
ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1
Using 2 MPI threads
Using 6 OpenMP threads per tMPI thread
On host I7 2 GPUs auto-selected for this run.
PP:0,PP:1
Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.
and ending with ^C
Received the INT signal, stopping within 200 steps
DLB was locked at the end of the run due to unfinished PP-PME balancing.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.
Core t (s) Wall t (s) (%)
Time: 543.475 45.290 1200.0
(ns/day) (hour/ns)
Performance: 1.719 13.963 before DBL is turned on
Very poor performance. I have been following - or trying to follow - "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov - 2016 but have not yet broken the code.
----------------
gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on.
Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1
Using 2 MPI threads
Using 3 OpenMP threads per tMPI thread
On host I7 2 GPUs auto-selected for this run.
PP:0,PP:1
Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.
NOTE: DLB can now turn on, when beneficial
^C
Received the INT signal, stopping within 200 steps
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.
Core t (s) Wall t (s) (%)
Time: 953.837 158.973 600.0
(ns/day) (hour/ns)
Performance: 2.935 8.176
====================
the beginning of the log file is
GROMACS version: 2018.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: SSE4.1
FFT library: fftw-3.3.8-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-10-19 21:26:38
Build OS/arch: Linux 4.15.0-20-generic x86_64
Build CPU vendor: Intel
Build CPU family: 6 Model: 44 Stepping: 2
Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/gcc-6 GNU 6.4.0
C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/g++-6 GNU 6.4.0
C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.10
CUDA runtime: 9.10
Running on 1 node with total 12 cores, 12 logical cores, 2 compatible GPUs
Vendor: Intel
Family: 6 Model: 44 Stepping: 2
Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
Hardware topology: Only logical processor count
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
There were no errors encountered during the runs. Suggestions would be appreciated.
Regards
Paul
I have good luck using a single GPU with the basic setup.. However in going from one gtx 1060 to a system with two - 50,000 atoms - the rate decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent ( water ) and a lipid membrane
the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram.
ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1
Using 2 MPI threads
Using 6 OpenMP threads per tMPI thread
On host I7 2 GPUs auto-selected for this run.
PP:0,PP:1
Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.
and ending with ^C
Received the INT signal, stopping within 200 steps
DLB was locked at the end of the run due to unfinished PP-PME balancing.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.
Core t (s) Wall t (s) (%)
Time: 543.475 45.290 1200.0
(ns/day) (hour/ns)
Performance: 1.719 13.963 before DBL is turned on
Very poor performance. I have been following - or trying to follow - "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov - 2016 but have not yet broken the code.
----------------
gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on.
Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1
Using 2 MPI threads
Using 3 OpenMP threads per tMPI thread
On host I7 2 GPUs auto-selected for this run.
PP:0,PP:1
Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.
NOTE: DLB can now turn on, when beneficial
^C
Received the INT signal, stopping within 200 steps
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.
Core t (s) Wall t (s) (%)
Time: 953.837 158.973 600.0
(ns/day) (hour/ns)
Performance: 2.935 8.176
====================
the beginning of the log file is
GROMACS version: 2018.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: SSE4.1
FFT library: fftw-3.3.8-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-10-19 21:26:38
Build OS/arch: Linux 4.15.0-20-generic x86_64
Build CPU vendor: Intel
Build CPU family: 6 Model: 44 Stepping: 2
Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/gcc-6 GNU 6.4.0
C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/g++-6 GNU 6.4.0
C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.10
CUDA runtime: 9.10
Running on 1 node with total 12 cores, 12 logical cores, 2 compatible GPUs
Vendor: Intel
Family: 6 Model: 44 Stepping: 2
Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
Hardware topology: Only logical processor count
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
There were no errors encountered during the runs. Suggestions would be appreciated.
Regards
Paul
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-***@gromacs.org.
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-***@gromacs.org.