HPC/Applications/lammps: Difference between revisions
m (→Binaries) |
mNo edit summary |
||
Line 1: | Line 1: | ||
== Binaries == | == Binaries == | ||
As of module version lammps/2012-10-10-3 several LAMMPS binaries are provided within one module. | As of module version lammps/2012-10-10-3 several LAMMPS binaries are provided within one module. | ||
Binaries compiled with GPU support will not run on nodes without a GPU | Binaries compiled with GPU support will not run on nodes without a GPU | ||
(CUDA libraries are deliberately only installed on GPU nodes.) | (CUDA libraries are deliberately only installed on GPU nodes.) | ||
Moreover, a binary built with the USER-CUDA package ''will'' attempt to access the GPU by default [http://lammps.sandia.gov/doc/Section_start.html#start_7]. | |||
{| class="wikitable" cellpadding="5" style="text-align:left;" | {| class="wikitable" cellpadding="5" style="text-align:left;" | ||
Line 25: | Line 26: | ||
lmp_openmpi -in in.script | lmp_openmpi -in in.script | ||
</syntaxhighlight> | </syntaxhighlight> | ||
== GPU support == | |||
LAMMPS offers ''two different'' packages for using GPUs. These are fully documented in the LAMMPS manual, [http://lammps.sandia.gov/doc/Section_accelerate.html Section 5. Accelerating LAMMPS performance.] To use LAMMPS with GPUs on Carbon you must read and understand the relevant sections of that page. | |||
=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_6 GPU package] === | |||
* provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS. | |||
* restricted to one physical GPU per LAMMPS process. | |||
* multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way. | |||
mpirun … lmp_openmpi-gpu -in ''infile'' | |||
=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_7 USER-CUDA package] === | |||
* provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.<br> | |||
* only supports use of a single CPU (core) with each GPU. | |||
lmp_openmpi-user-cuda -suffix cuda -in ''infile'' | |||
=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8 Comparison of GPU and USER-CUDA packages] === | |||
== Benchmark == | == Benchmark == | ||
Line 56: | Line 73: | ||
|} | |} | ||
<!-- | |||
=== Sample job file gen1 === | === Sample job file gen1 === | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
Line 94: | Line 112: | ||
lmp_openmpi < lammps.in > lammps.out 2> lammps.err | lmp_openmpi < lammps.in > lammps.out 2> lammps.err | ||
</syntaxhighlight> | </syntaxhighlight> | ||
//--> | |||
== MPI/OpenMP hybrid parallel runs == | == MPI/OpenMP hybrid parallel runs == |
Revision as of 23:00, November 1, 2012
Binaries
As of module version lammps/2012-10-10-3 several LAMMPS binaries are provided within one module. Binaries compiled with GPU support will not run on nodes without a GPU (CUDA libraries are deliberately only installed on GPU nodes.) Moreover, a binary built with the USER-CUDA package will attempt to access the GPU by default [1].
Binary name | Description |
---|---|
lmp_openmpi-main |
The baseline binary, containing the packages shown by module help lammps .
|
lmp_openmpi |
The distribution's default name; synonym for lmp_openmpi-main ;
|
lmp_openmpi-gpu |
The package "gpu" and all packages from main. |
lmp_openmpi-user-cuda |
The package "user-cuda" and all packages from main. |
lmp_openmpi-jr |
A custom build for user J.R. |
Simply name the appropriate binary in the job file, typically as argument to mpirun
:
…
mpirun -machinefile $PBS_NODEFILE -np $PBS_NP \
lmp_openmpi -in in.script
GPU support
LAMMPS offers two different packages for using GPUs. These are fully documented in the LAMMPS manual, Section 5. Accelerating LAMMPS performance. To use LAMMPS with GPUs on Carbon you must read and understand the relevant sections of that page.
GPU package
- provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS.
- restricted to one physical GPU per LAMMPS process.
- multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way.
mpirun … lmp_openmpi-gpu -in infile
USER-CUDA package
- provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.
- only supports use of a single CPU (core) with each GPU.
lmp_openmpi-user-cuda -suffix cuda -in infile
Comparison of GPU and USER-CUDA packages
Benchmark
Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.
LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
Job tag | Node type | Interconnect | Additional OpenMPI options | Relative speed (1000 steps/3 hours) |
Notes |
---|---|---|---|---|---|
gen1 | gen1 | IB | (none) | 36 | |
gen1srqpin | gen1 | IB | -mca btl_openib_use_srq 1 -mca mpi_paffinity_alone 1 |
39 | |
gen1eth | gen1 | Ethernet | -mca btl self,tcp | 44 | fastest for gen1 |
gen2eth | gen2 | Ethernet | -mca btl self,tcp | 49 | |
gen2srq | gen2 | IB | -mca btl_openib_use_srq 1 | 59 | |
gen2 | gen2 | IB | (none) | 59 | fastest for gen2 |
MPI/OpenMP hybrid parallel runs
LAMMPS modules since 2012 are compiled with yes-user-omp
, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.
Be careful how to allocate CPU cores on compute nodes. Note the following:
- The number of cores on a node reserved for your use is determined by the qsub
ppn=...
parameter. - The number of MPI tasks (call it
ppn_mpi
) running on a node is determined by options to mpirun. - The number of threads that each MPI task runs with is determined by the environment variable
OMP_NUM_THREADS
, which is 1 by default on Carbon. - The number of physical cores per node for gen1 and gen2 nodes is 8.
- gen2 nodes have hyperthreading active, meaning there are 16 logical cores per node. However:
- The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled.
- My (stern) own benchmarks for a memory-intensive DFT program were underwhelming.
- The LAMMPS OpenMP author reports the same (near the end of the section):
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
Sample job script for hybrid parallel runs
In summary, the job script's essential parts are:
#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
...
ppn_mpi=2 # user choice
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!)
mpirun -x OMP_NUM_THREADS \
-machinefile $PBS_NODEFILE \
--npernode $ppn_mpi \
lmp_openmpi \
-sf omp \
-in in.script
Diagnostic for hybrid parallel runs
- LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012) using 4 OpenMP thread(s) per MPI task ... 1 by 2 by 2 MPI processor grid 104 atoms ...
and near the end:
Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
- To see if OpenMP is really active, log into a compute node while a job is running and run
top
orpsuser
– The%CPU
field should be aboutOMP_NUM_THREADS × 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8047 stern 25 0 4017m 33m 7540 R 401.8 0.1 1:41.60 lmp_openmpi 8044 stern 25 0 4017m 33m 7540 R 399.9 0.1 1:43.50 lmp_openmpi 4822 root 34 19 0 0 0 S 2.0 0.0 115:34.98 kipmi0
References
- HPC/Submitting_Jobs/Advanced node selection#Multithreading (OpenMP)
- LAMMPS documentation for the OMP package
- Command-line options (explanation for -sf style or -suffix style)