HPC/Applications/lammps: Difference between revisions

From CNM Wiki
Jump to navigation Jump to search
Line 60: Line 60:
== Package USER-CUDA ==
== Package USER-CUDA ==
* Provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.<br>
* Provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.<br>
* Only supports a single CPU (core) with each GPU [That means multiple nodes are OK -- to be tested --[[User:Stern|stern]] ]
* Only supports a single CPU (core) with each GPU [That should mean multiple nodes are possible; feasibility and efficiency to be determined --[[User:Stern|stern]] ]


=== Usage ===
=== Usage ===

Revision as of 16:25, November 2, 2012

Binaries

As of module version lammps/2012-10-10-3 several LAMMPS binaries are provided within one module. Binaries compiled with GPU support will not run on nodes without a GPU (CUDA libraries are deliberately only installed on GPU nodes.) Moreover, a binary built with the USER-CUDA package will attempt to access the GPU by default [1].

Binary name Description
lmp_openmpi-main The baseline binary, containing the packages shown by module help lammps.
lmp_openmpi The distribution's default name; synonym for lmp_openmpi-main;
lmp_openmpi-gpu The package "gpu" and all packages from main.
lmp_openmpi-user-cuda The package "user-cuda" and all packages from main.
lmp_openmpi-jr A custom build for user J.R.

Simply name the appropriate binary in the job file; full paths are neither necessary nor recommended.

GPU support

LAMMPS offers two different packages for using GPUs, one official, the other user-contributed. Only one of these packges can be used for a run. The packages are fully documented in the following sections of the LAMMPS manual:

To use LAMMPS with GPUs on Carbon you must read and understand these sections. A summary and Carbon-specific details are given in the following two sections.

Package GPU

  • Provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS; for the full list:
    1. In your browser, open http://lammps.sandia.gov/doc/Section_commands.html#comm
    2. Search for the string /cuda.
  • Supports one physical GPU per LAMMPS MPI process (CPU core).
  • Multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way.

Usage

  1. Use the command package gpu near the beginning of your LAMMPS control script. Since all Carbon GPU nodes have just one GPU per node, the first two arguments (called first and last) must always be zero; the split argument is not restricted.
  2. Do one of the following:
  3. Call the lmp_openmpi-gpu binary.

Input file examples

package gpu force 0 0 1.0
package gpu force 0 0 0.75
package gpu force/neigh 0 0 1.0
package gpu force/neigh 0 1 -1.0
…
pair_style      lj/charmm/coul/long/gpu 8.0 10.0

Job file example

mpirun … lmp_openmpi-gpu -in infile

Package USER-CUDA

  • Provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.
  • Only supports a single CPU (core) with each GPU [That should mean multiple nodes are possible; feasibility and efficiency to be determined --stern ]

Usage

  1. Optional: Use the command package cuda near the beginning of your LAMMPS control script to finely control settings. This is optional since a LAMMPS binary with USER-CUDA always detects and uses a GPU by default.
  2. Do one of the following:
  3. Optional: The kspace_style pppm/cuda command has to be requested explicitly. [I am not sure if that means that other k-space style implicitly use the GPU; --stern ].
  4. Call the lmp_openmpi-user-cuda binary.

Input file example

Examples:

package cuda gpu/node/special 2 0 2
package cuda test 3948
…
kspace_style    pppm/cuda 1e-5

Job file example

lmp_openmpi-user-cuda -suffix cuda -in infile

Benchmark

Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.

LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).

Job tag Node type Interconnect Additional OpenMPI options Relative speed
(1000 steps/3 hours)
Notes
gen1 gen1 IB (none) 36
gen1srqpin gen1 IB -mca btl_openib_use_srq 1
-mca mpi_paffinity_alone 1
39
gen1eth gen1 Ethernet -mca btl self,tcp 44 fastest for gen1
gen2eth gen2 Ethernet -mca btl self,tcp 49
gen2srq gen2 IB -mca btl_openib_use_srq 1 59
gen2 gen2 IB (none) 59 fastest for gen2


MPI/OpenMP hybrid parallel runs

LAMMPS modules since 2012 are compiled with yes-user-omp, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.

Be careful how to allocate CPU cores on compute nodes. Note the following:

  • The number of cores on a node reserved for your use is determined by the qsub ppn=... parameter.
  • The number of MPI tasks (call it ppn_mpi) running on a node is determined by options to mpirun.
  • The number of threads that each MPI task runs with is determined by the environment variable OMP_NUM_THREADS, which is 1 by default on Carbon.
  • The number of physical cores per node for gen1 and gen2 nodes is 8.
  • gen2 nodes have hyperthreading active, meaning there are 16 logical cores per node. However:

Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.

Sample job script for hybrid parallel runs

In summary, the job script's essential parts are:

#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
...

ppn_mpi=2		# user choice
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' )	# grab first (and usually only) ppn value of the job
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi ))			# calculate number of threads available per MPI process (integer arithmetic!)

mpirun -x OMP_NUM_THREADS \
    -machinefile  $PBS_NODEFILE \
    --npernode $ppn_mpi \
    lmp_openmpi \
	-sf omp \
	-in in.script

Diagnostic for hybrid parallel runs

  • LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012)
  using 4 OpenMP thread(s) per MPI task
...
  1 by 2 by 2 MPI processor grid
  104 atoms
...

and near the end:

Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
  • To see if OpenMP is really active, log into a compute node while a job is running and run top or psuser – The %CPU field should be about OMP_NUM_THREADS × 100%
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                             
8047 stern     25   0 4017m  33m 7540 R 401.8  0.1   1:41.60 lmp_openmpi                                                                                         
8044 stern     25   0 4017m  33m 7540 R 399.9  0.1   1:43.50 lmp_openmpi                                                                                         
4822 root      34  19     0    0    0 S  2.0  0.0 115:34.98 kipmi0

References