HPC/Applications/lammps: Difference between revisions

From CNM Wiki
Jump to navigation Jump to search
Line 74: Line 74:


Be careful how to allocate CPU cores on compute nodes. Note the following:
Be careful how to allocate CPU cores on compute nodes. Note the following:
* The number of MPI tasks (call it  <code>ppn_mpi</code>) running on a node is determined by the qsub <code>ppn=...</code> parameter.  
* The number of cores on a node reserved for your use  is determined by the qsub <code>ppn=...</code> parameter.
* The number of MPI tasks (call it  <code>ppn_mpi</code>) running on a node is determined by options to mpirun.
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
* The number of physical cores per node for gen1 and gen2 nodes is 8.
* The number of physical cores per node for gen1 and gen2 nodes is 8.
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However:
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However:
** The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled.
** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming.
** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming.
** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section):
** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section):
Line 83: Line 85:
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.  
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.  
</blockquote>
</blockquote>
** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that
<!-- ** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that
  ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16
  ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16
* Reserve ''entire'' nodes by adding near the top of the job file:
* Reserve ''entire'' nodes by adding near the top of the job file:
  #PBS -l naccesspolicy=SINGLEJOB
  #PBS -l naccesspolicy=SINGLEJOB
 
-->
=== Sample job script for hybrid parallel runs ===
=== Sample job script for hybrid parallel runs ===
In summary, the job script's essential parts are:
In summary, the job script's essential parts are:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
#!/bin/bash
#!/bin/bash
#PBS -l nodes=2:ppn=4
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
#PBS -l walltime=1:00:00
#PBS -l naccesspolicy=SINGLEJOB
...
...


ppn_mpi=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job
ppn_mpi=2 # user choice
#ppn_active=8 # number of cores on first execution node
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job
ppn_active=12 # experimental: 3-to-2 oversubscription of physical cores
 
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!)
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!)


mpirun -x OMP_NUM_THREADS \
mpirun -x OMP_NUM_THREADS \
     -machinefile  $PBS_NODEFILE \
     -machinefile  $PBS_NODEFILE \
     -np $(wc -l < $PBS_NODEFILE) \
     --npernode $ppn_mpi \
     lmp_openmpi \
     lmp_openmpi \
-sf omp \
-sf omp \
Line 115: Line 113:
* LAMMPS echoes it parallelization scheme first thing in the output:
* LAMMPS echoes it parallelization scheme first thing in the output:
  LAMMPS (10 Feb 2012)
  LAMMPS (10 Feb 2012)
   '''using 3 OpenMP thread(s) per MPI task'''
   '''using 4 OpenMP thread(s) per MPI task'''
  ...
  ...
   2 by 2 by 2 MPI processor grid
   1 by 2 by 2 MPI processor grid
Lattice spacing in x,y,z = 3.52 4.97803 4.97803
  104 atoms
  ...
  ...
and near the end:
and near the end:
  Loop time of 11.473 on 24 procs ('''8 MPI x 3 OpenMP''') for 100 steps with 32000 atoms
  Loop time of 124.809 on 16 procs ('''4 MPI x 4 OpenMP''') for 30000 steps with 104 atoms
* To see if OpenMP is really active, log into a compute node while a job is running and run <code>top</code> or <code>psuser</code> – The <code>%CPU</code> field should be about <code>OMP_NUM_THREADS × 100%</code>
* To see if OpenMP is really active, log into a compute node while a job is running and run <code>top</code> or <code>psuser</code> – The <code>%CPU</code> field should be about <code>OMP_NUM_THREADS × 100%</code>
   PID USER      PR  NI  VIRT  RES  SHR S '''%CPU''' %MEM    TIME+  COMMAND                                                                                             
   PID USER      PR  NI  VIRT  RES  SHR S '''%CPU''' %MEM    TIME+  COMMAND                                                                                             
  8047 stern    25  0 4017m  33m 7540 R '''301.8'''  0.1  1:41.60 lmp_openmpi                                                                                         
  8047 stern    25  0 4017m  33m 7540 R '''401.8'''  0.1  1:41.60 lmp_openmpi                                                                                         
  8044 stern    25  0 4017m  33m 7540 R '''299.9'''  0.1  1:43.50 lmp_openmpi                                                                                       
  8044 stern    25  0 4017m  33m 7540 R '''399.9'''  0.1  1:43.50 lmp_openmpi                                                                                         
8045 stern    25  0 4016m  33m 7536 R '''299.9'''  0.1  1:43.47 lmp_openmpi                                                                                       
8046 stern    25  0 4017m  33m 7540 R '''297.9'''  0.1  1:41.60 lmp_openmpi                                                                                         
  4822 root      34  19    0    0    0 S  2.0  0.0 115:34.98 kipmi0
  4822 root      34  19    0    0    0 S  2.0  0.0 115:34.98 kipmi0
=== Bug ===
It appears at the moment that only the job's master node (first in $PBS_NODEFILE) runs multi-threaded – other nodes only get 100% CPU per MPI process, despite OMP_NUM_THREADS being correctly exported.
psuser
strings -a /proc/''pid''/environ | grep OMP_NUM_THREADS
OMP_NUM_THREADS=3


=== References ===
=== References ===

Revision as of 21:23, February 13, 2012

Benchmark

Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.

LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).

Job tag Node type Interconnect Additional OpenMPI options Relative speed
(1000 steps/3 hours)
Notes
gen1 gen1 IB (none) 36
gen1srqpin gen1 IB -mca btl_openib_use_srq 1
-mca mpi_paffinity_alone 1
39
gen1eth gen1 Ethernet -mca btl self,tcp 44 fastest for gen1
gen2eth gen2 Ethernet -mca btl self,tcp 49
gen2srq gen2 IB -mca btl_openib_use_srq 1 59
gen2 gen2 IB (none) 59 fastest for gen2

Sample job file gen1

#!/bin/bash
#PBS -l nodes=10:ppn=8:gen1
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea

# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR

mpirun  -machinefile  $PBS_NODEFILE \
        -np $(wc -l < $PBS_NODEFILE) \
        -mca btl self,tcp \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err

Sample job file gen2

#!/bin/bash
#PBS -l nodes=10:ppn=8:gen2
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea

# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR

mpirun  -machinefile  $PBS_NODEFILE \
        -np $(wc -l < $PBS_NODEFILE) \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err

MPI/OpenMP hybrid parallel runs

LAMMPS modules since 2012 are compiled with yes-user-omp, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.

Be careful how to allocate CPU cores on compute nodes. Note the following:

  • The number of cores on a node reserved for your use is determined by the qsub ppn=... parameter.
  • The number of MPI tasks (call it ppn_mpi) running on a node is determined by options to mpirun.
  • The number of threads that each MPI task runs with is determined by the environment variable OMP_NUM_THREADS, which is 1 by default on Carbon.
  • The number of physical cores per node for gen1 and gen2 nodes is 8.
  • gen2 nodes have hyperthreading active, meaning there are 16 logical cores per node. However:

Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.

Sample job script for hybrid parallel runs

In summary, the job script's essential parts are:

#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
...

ppn_mpi=2		# user choice
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' )	# grab first (and usually only) ppn value of the job
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi ))			# calculate number of threads available per MPI process (integer arithmetic!)

mpirun -x OMP_NUM_THREADS \
    -machinefile  $PBS_NODEFILE \
    --npernode $ppn_mpi \
    lmp_openmpi \
	-sf omp \
	-in in.script

Diagnostic for hybrid parallel runs

  • LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012)
  using 4 OpenMP thread(s) per MPI task
...
  1 by 2 by 2 MPI processor grid
  104 atoms
...

and near the end:

Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
  • To see if OpenMP is really active, log into a compute node while a job is running and run top or psuser – The %CPU field should be about OMP_NUM_THREADS × 100%
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                             
8047 stern     25   0 4017m  33m 7540 R 401.8  0.1   1:41.60 lmp_openmpi                                                                                         
8044 stern     25   0 4017m  33m 7540 R 399.9  0.1   1:43.50 lmp_openmpi                                                                                         
4822 root      34  19     0    0    0 S  2.0  0.0 115:34.98 kipmi0

References