HPC/Applications/lammps: Difference between revisions
< HPC | Applications
Jump to navigation
Jump to search
m (→References) |
|||
Line 74: | Line 74: | ||
Be careful how to allocate CPU cores on compute nodes. Note the following: | Be careful how to allocate CPU cores on compute nodes. Note the following: | ||
* The number of MPI tasks (call it <code>ppn_mpi</code>) running on a node is determined by | * The number of cores on a node reserved for your use is determined by the qsub <code>ppn=...</code> parameter. | ||
* The number of MPI tasks (call it <code>ppn_mpi</code>) running on a node is determined by options to mpirun. | |||
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon. | * The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon. | ||
* The number of physical cores per node for gen1 and gen2 nodes is 8. | * The number of physical cores per node for gen1 and gen2 nodes is 8. | ||
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However: | * gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However: | ||
** The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled. | |||
** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming. | ** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming. | ||
** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section): | ** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section): | ||
Line 83: | Line 85: | ||
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading. | Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading. | ||
</blockquote> | </blockquote> | ||
** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that | <!-- ** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that | ||
ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16 | ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16 | ||
* Reserve ''entire'' nodes by adding near the top of the job file: | * Reserve ''entire'' nodes by adding near the top of the job file: | ||
#PBS -l naccesspolicy=SINGLEJOB | #PBS -l naccesspolicy=SINGLEJOB | ||
--> | |||
=== Sample job script for hybrid parallel runs === | === Sample job script for hybrid parallel runs === | ||
In summary, the job script's essential parts are: | In summary, the job script's essential parts are: | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
#!/bin/bash | #!/bin/bash | ||
#PBS -l nodes=2:ppn= | #PBS -l nodes=2:ppn=8 | ||
#PBS -l walltime=1:00:00 | #PBS -l walltime=1:00:00 | ||
... | ... | ||
ppn_mpi=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job | ppn_mpi=2 # user choice | ||
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job | |||
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!) | OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!) | ||
mpirun -x OMP_NUM_THREADS \ | mpirun -x OMP_NUM_THREADS \ | ||
-machinefile $PBS_NODEFILE \ | -machinefile $PBS_NODEFILE \ | ||
- | --npernode $ppn_mpi \ | ||
lmp_openmpi \ | lmp_openmpi \ | ||
-sf omp \ | -sf omp \ | ||
Line 115: | Line 113: | ||
* LAMMPS echoes it parallelization scheme first thing in the output: | * LAMMPS echoes it parallelization scheme first thing in the output: | ||
LAMMPS (10 Feb 2012) | LAMMPS (10 Feb 2012) | ||
'''using | '''using 4 OpenMP thread(s) per MPI task''' | ||
... | ... | ||
1 by 2 by 2 MPI processor grid | |||
104 atoms | |||
... | ... | ||
and near the end: | and near the end: | ||
Loop time of | Loop time of 124.809 on 16 procs ('''4 MPI x 4 OpenMP''') for 30000 steps with 104 atoms | ||
* To see if OpenMP is really active, log into a compute node while a job is running and run <code>top</code> or <code>psuser</code> – The <code>%CPU</code> field should be about <code>OMP_NUM_THREADS × 100%</code> | * To see if OpenMP is really active, log into a compute node while a job is running and run <code>top</code> or <code>psuser</code> – The <code>%CPU</code> field should be about <code>OMP_NUM_THREADS × 100%</code> | ||
PID USER PR NI VIRT RES SHR S '''%CPU''' %MEM TIME+ COMMAND | PID USER PR NI VIRT RES SHR S '''%CPU''' %MEM TIME+ COMMAND | ||
8047 stern 25 0 4017m 33m 7540 R ''' | 8047 stern 25 0 4017m 33m 7540 R '''401.8''' 0.1 1:41.60 lmp_openmpi | ||
8044 stern 25 0 4017m 33m 7540 R ''' | 8044 stern 25 0 4017m 33m 7540 R '''399.9''' 0.1 1:43.50 lmp_openmpi | ||
4822 root 34 19 0 0 0 S 2.0 0.0 115:34.98 kipmi0 | 4822 root 34 19 0 0 0 S 2.0 0.0 115:34.98 kipmi0 | ||
=== References === | === References === |
Revision as of 21:23, February 13, 2012
Benchmark
Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.
LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
Job tag | Node type | Interconnect | Additional OpenMPI options | Relative speed (1000 steps/3 hours) |
Notes |
---|---|---|---|---|---|
gen1 | gen1 | IB | (none) | 36 | |
gen1srqpin | gen1 | IB | -mca btl_openib_use_srq 1 -mca mpi_paffinity_alone 1 |
39 | |
gen1eth | gen1 | Ethernet | -mca btl self,tcp | 44 | fastest for gen1 |
gen2eth | gen2 | Ethernet | -mca btl self,tcp | 49 | |
gen2srq | gen2 | IB | -mca btl_openib_use_srq 1 | 59 | |
gen2 | gen2 | IB | (none) | 59 | fastest for gen2 |
Sample job file gen1
#!/bin/bash
#PBS -l nodes=10:ppn=8:gen1
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea
# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR
mpirun -machinefile $PBS_NODEFILE \
-np $(wc -l < $PBS_NODEFILE) \
-mca btl self,tcp \
lmp_openmpi < lammps.in > lammps.out 2> lammps.err
Sample job file gen2
#!/bin/bash
#PBS -l nodes=10:ppn=8:gen2
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea
# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR
mpirun -machinefile $PBS_NODEFILE \
-np $(wc -l < $PBS_NODEFILE) \
lmp_openmpi < lammps.in > lammps.out 2> lammps.err
MPI/OpenMP hybrid parallel runs
LAMMPS modules since 2012 are compiled with yes-user-omp
, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.
Be careful how to allocate CPU cores on compute nodes. Note the following:
- The number of cores on a node reserved for your use is determined by the qsub
ppn=...
parameter. - The number of MPI tasks (call it
ppn_mpi
) running on a node is determined by options to mpirun. - The number of threads that each MPI task runs with is determined by the environment variable
OMP_NUM_THREADS
, which is 1 by default on Carbon. - The number of physical cores per node for gen1 and gen2 nodes is 8.
- gen2 nodes have hyperthreading active, meaning there are 16 logical cores per node. However:
- The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled.
- My (stern) own benchmarks for a memory-intensive DFT program were underwhelming.
- The LAMMPS OpenMP author reports the same (near the end of the section):
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
Sample job script for hybrid parallel runs
In summary, the job script's essential parts are:
#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
...
ppn_mpi=2 # user choice
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!)
mpirun -x OMP_NUM_THREADS \
-machinefile $PBS_NODEFILE \
--npernode $ppn_mpi \
lmp_openmpi \
-sf omp \
-in in.script
Diagnostic for hybrid parallel runs
- LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012) using 4 OpenMP thread(s) per MPI task ... 1 by 2 by 2 MPI processor grid 104 atoms ...
and near the end:
Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
- To see if OpenMP is really active, log into a compute node while a job is running and run
top
orpsuser
– The%CPU
field should be aboutOMP_NUM_THREADS × 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8047 stern 25 0 4017m 33m 7540 R 401.8 0.1 1:41.60 lmp_openmpi 8044 stern 25 0 4017m 33m 7540 R 399.9 0.1 1:43.50 lmp_openmpi 4822 root 34 19 0 0 0 S 2.0 0.0 115:34.98 kipmi0
References
- HPC/Submitting_Jobs/Advanced node selection#Multithreading (OpenMP)
- LAMMPS documentation for the OMP package
- Command-line options (explanation for -sf style or -suffix style)