HPC/Applications/lammps: Difference between revisions

From CNM Wiki
Jump to navigation Jump to search
Line 71: Line 71:


== MPI/OpenMP hybrid parallel runs ==
== MPI/OpenMP hybrid parallel runs ==
LAMMPS modules since 2012 are compiled with <code>yes-user-omp</code>, permitting multi-threaded runs, and in particular MPI/OpenMP hybrid parallel runs.
LAMMPS modules since 2012 are compiled with [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 <code>yes-user-omp</code>], permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.


Be careful how to allocate CPU cores on compute nodes. Note the following:
Be careful how to allocate CPU cores on compute nodes. Note the following:
Line 77: Line 77:
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
* The number of physical cores per node for gen1 and gen2 nodes is 8.
* The number of physical cores per node for gen1 and gen2 nodes is 8.
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. Benchmarks for memory-intensive programs were underwhelming, but LAMMPS may benefit significantly from over-subscribing
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However:
** My ([[User:Stern|stern]]) own benchmarks for memory-intensive programs were underwhelming.
** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section):
<blockquote>
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
</blockquote>
* Reserve ''entire'' nodes by adding near the top of the job file:
* Reserve ''entire'' nodes by adding near the top of the job file:
  #PBS -l naccesspolicy=SINGLEJOB
  #PBS -l naccesspolicy=SINGLEJOB

Revision as of 06:29, February 9, 2012

Benchmark

Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.

LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).

Job tag Node type Interconnect Additional OpenMPI options Relative speed
(1000 steps/3 hours)
Notes
gen1 gen1 IB (none) 36
gen1srqpin gen1 IB -mca btl_openib_use_srq 1
-mca mpi_paffinity_alone 1
39
gen1eth gen1 Ethernet -mca btl self,tcp 44 fastest for gen1
gen2eth gen2 Ethernet -mca btl self,tcp 49
gen2srq gen2 IB -mca btl_openib_use_srq 1 59
gen2 gen2 IB (none) 59 fastest for gen2

Sample job file gen1

#!/bin/bash
#PBS -l nodes=10:ppn=8:gen1
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea

# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR

mpirun  -machinefile  $PBS_NODEFILE \
        -np $(wc -l < $PBS_NODEFILE) \
        -mca btl self,tcp \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err

Sample job file gen2

#!/bin/bash
#PBS -l nodes=10:ppn=8:gen2
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea

# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR

mpirun  -machinefile  $PBS_NODEFILE \
        -np $(wc -l < $PBS_NODEFILE) \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err

MPI/OpenMP hybrid parallel runs

LAMMPS modules since 2012 are compiled with yes-user-omp, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.

Be careful how to allocate CPU cores on compute nodes. Note the following:

  • The number of MPI tasks running on a node is determined by the qsub ppn=... parameter.
  • The number of threads that each MPI task runs with is determined by the environment variable OMP_NUM_THREADS, which is 1 by default on Carbon.
  • The number of physical cores per node for gen1 and gen2 nodes is 8.
  • gen2 nodes have hyperthreading active, meaning there are 16 logical cores per node. However:
    • My (stern) own benchmarks for memory-intensive programs were underwhelming.
    • The LAMMPS OpenMP author reports the same (near the end of the section):

Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.

  • Reserve entire nodes by adding near the top of the job file:
#PBS -l naccesspolicy=SINGLEJOB
#!/bin/bash
#PBS -l nodes=2:ppn=4
#PBS -l walltime=1:00:00
#PBS -l naccesspolicy=SINGLEJOB
...

ppn_mpi=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' )	# grab first (and usually only) ppn value of the job
ppn_logical=8							# number of cores on first execution node
#ppn_logical=12							# experimental: 3-to-2 oversubscription of physical cores (hyperthreading)

OMP_NUM_THREADS=$(( ppn_phys / ppn_mpi ))			# calculate number of threads available per MPI process

mpirun -x OMP_NUM_THREADS \
    -machinefile  $PBS_NODEFILE \
    -np $(wc -l < $PBS_NODEFILE) \
    lmp_openmpi \
	-sf omp \
	-in in.script

LAMMPS echoes it parallelization scheme first thing in the output:

LAMMPS (10 Feb 2012)
  using 3 OpenMP thread(s) per MPI task
...
  2 by 2 by 2 MPI processor grid
Lattice spacing in x,y,z = 3.52 4.97803 4.97803
...

and near the end:

Loop time of 11.473 on 24 procs (8 MPI x 3 OpenMP) for 100 steps with 32000 atoms


To learn more: