HPC/Applications/lammps: Difference between revisions

Revision as of 22:00, November 5, 2012

Binaries

As of module version lammps/2012-10-10-3 (which currently is the default) several LAMMPS binaries are provided within one module. Binaries compiled with GPU support will not run on nodes without a GPU (CUDA libraries are deliberately only installed on GPU nodes.) Moreover, a binary built with the USER-CUDA package will attempt to access the GPU by default [1].

Binary name	Description
`lmp_openmpi-main`	The baseline binary, containing the packages shown by `module help lammps`.
`lmp_openmpi`	The distribution's default name; synonym for `lmp_openmpi-main`;
`lmp_openmpi-gpu`	The package "gpu" and all packages from main.
`lmp_openmpi-user-cuda`	The package "user-cuda" and all packages from main.
`lmp_openmpi-jr`	A custom build for user J.R.

To use the *-gpu and *=user-cuda binaries, load the cuda module in addition to lammps.

module load cuda lammps

To use any one of the binaries, simply name the appropriate one in the job file; full paths are neither necessary nor recommended.

GPU support

LAMMPS offers two different packages for using GPUs, one official, the other user-contributed. Only one of these packges can be used for a run. The packages are fully documented in the following sections of the LAMMPS manual:

5. Accelerating LAMMPS performance

To use LAMMPS with GPUs on Carbon you must read and understand these sections. A summary and Carbon-specific details are given in the following two sections.

General note on GPU jobs

To request your job to run on a GPU node use in the job file:

#PBS -l nodes=…:gpus=1

At the moment this is synonymous with but preferable to:

#PBS -l nodes=…:gen3

Each GPU node has 12 cores; if you submit jobs with :ppn < 12 and :gpus=1 the node may be shared with purely CPU jobs. It is to be tested if and how much interference this causes for either job. See Advanced node selection to reserve entire nodes while controlling ppn for MPI or OpenMP.

Package GPU

Provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS; for the full list:
1. In your browser, open http://lammps.sandia.gov/doc/Section_commands.html#comm
2. Search for the string /cuda.
Supports one physical GPU per LAMMPS MPI process (CPU core).
Multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way.

Usage

Use the command package gpu near the beginning of your LAMMPS control script. Since all Carbon GPU nodes have just one GPU per node, the first two arguments (called first and last) must always be zero; the split argument is not restricted.
Do one of the following:
- Append /gpu to the style name (e.g. pair_style lj/cut/gpu).
- Use the suffix gpu command.
- On the command line, use the -suffix gpu switch.
In the job file or qsub command line, request a GPU #PBS -l nodes=...:gpus=1 (referring to the number of GPUs per node).
Call the lmp_openmpi-gpu binary.

Input file examples

package gpu force 0 0 1.0
package gpu force 0 0 0.75
package gpu force/neigh 0 0 1.0
package gpu force/neigh 0 1 -1.0

…
pair_style      lj/charmm/coul/long/gpu 8.0 10.0

Job file example

#PBS -l nodes=...:gpus=1
…
mpirun … lmp_openmpi-gpu -in infile

Package USER-CUDA

Provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.
Only supports a single CPU (core) with each GPU [That should mean multiple nodes are possible; feasibility and efficiency to be determined --stern ]

Usage

Optional: Use the command package cuda near the beginning of your LAMMPS control script to finely control settings. This is optional since a LAMMPS binary with USER-CUDA always detects and uses a GPU by default.
Do one of the following:
- Append /cuda to the style name (e.g. pair_style lj/cut/cuda)
- Use the suffix cuda command.
- On the command line, use the -suffix cuda switch.
Optional: The kspace_style pppm/cuda command has to be requested explicitly. [I am not sure if that means that other k-space styles implicitly use the GPU --stern. ]
In the job file or qsub command line, request a GPU #PBS -l nodes=...:gpus=1.
Call the lmp_openmpi-user-cuda binary.

Input file example

Examples:

package cuda gpu/node/special 2 0 2
package cuda test 3948

…
kspace_style    pppm/cuda 1e-5

Job file example

Serial job:

#PBS -l nodes=1:ppn=1:gpus=1
…
lmp_openmpi-user-cuda -suffix cuda -in infile

Parallel job; note that ppn must still be 1 as only one LAMMPS process (core) per node can use the sole GPU.

#PBS -l nodes=3:ppn=1:gpus=1
…
mpirun -machinefile $PBS_NODEFILE -np $PBS_NP lmp_openmpi-user-cuda -suffix cuda -in infile

MPI/OpenMP hybrid parallel runs

LAMMPS modules since 2012 are compiled with yes-user-omp, permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs. To set up such runs, see HPC/Submitting and Managing Jobs/Advanced node selection.

For sample PBS scripts, consult these files:

$LAMMPS_HOME/sample.job
$LAMMPS_HOME/sample-hybrid.job

Benchmark (pre-GPU version)

Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on node types gen1 and gen2.

LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).

Job tag	Node type	Interconnect	Additional OpenMPI options	Relative speed (1000 steps/3 hours)	Notes
gen1	gen1	IB	(none)	36
gen1srqpin	gen1	IB	-mca btl_openib_use_srq 1 -mca mpi_paffinity_alone 1	39
gen1eth	gen1	Ethernet	-mca btl self,tcp	44	fastest for gen1
gen2eth	gen2	Ethernet	-mca btl self,tcp	49
gen2srq	gen2	IB	-mca btl_openib_use_srq 1	59
gen2	gen2	IB	(none)	59	fastest for gen2

Diagnostic for hybrid parallel runs

LAMMPS echoes it parallelization scheme first thing in the output:

LAMMPS (10 Feb 2012)
  using 4 OpenMP thread(s) per MPI task
...
  1 by 2 by 2 MPI processor grid
  104 atoms
...

and near the end:

Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms

To see if OpenMP is really active, log into a compute node while a job is running and run top or psuser – The %CPU field should be about OMP_NUM_THREADS × 100%

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                             
8047 stern     25   0 4017m  33m 7540 R 401.8  0.1   1:41.60 lmp_openmpi                                                                                         
8044 stern     25   0 4017m  33m 7540 R 399.9  0.1   1:43.50 lmp_openmpi                                                                                         
4822 root      34  19     0    0    0 S  2.0  0.0 115:34.98 kipmi0

References

HPC/Submitting_Jobs/Advanced node selection#Multithreading (OpenMP)
LAMMPS documentation for the OMP package
Command-line options (explanation for -sf style or -suffix style)

@@ Line 121: / Line 121: @@
   #PBS -l naccesspolicy=SINGLEJOB
 -->
+For sample PBS scripts, consult these files:
+ $LAMMPS_HOME/sample.job
+ $LAMMPS_HOME/sample-hybrid.job
 == Benchmark (pre-GPU version) ==
-Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.
+Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on node types gen1 and gen2.
 LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
@@ Line 167: / Line 171: @@
 stern     25   0 4017m  33m 7540 R '''399.9'''  0.1   1:43.50 lmp_openmpi
 root      34  19     0    0    0 S  2.0  0.0 115:34.98 kipmi0
-<!--
-=== Sample job file gen1 ===
-<syntaxhighlight lang="bash">
-#!/bin/bash
-#PBS -l nodes=10:ppn=8:gen1
-#PBS -l walltime=1:00:00:00
-#PBS -N <jobname>
-#PBS -A <account>
-#
-#PBS -o job.out
-#PBS -e job.err
-#PBS -m ea
-# change into the directory where qsub will be executed
-cd $PBS_O_WORKDIR
-mpirun  -machinefile  $PBS_NODEFILE -np $PBS_NP \
-        -mca btl self,tcp \
-        lmp_openmpi < lammps.in > lammps.out 2> lammps.err
-</syntaxhighlight>
-=== Sample job file gen2 ===
-<syntaxhighlight lang="bash">
-#!/bin/bash
-#PBS -l nodes=10:ppn=8:gen2
-#PBS -l walltime=1:00:00:00
-#PBS -N <jobname>
-#PBS -A <account>
-#
-#PBS -o job.out
-#PBS -e job.err
-#PBS -m ea
-# change into the directory where qsub will be executed
-cd $PBS_O_WORKDIR
-mpirun  -machinefile  $PBS_NODEFILE -np $PBS_NP \
-        lmp_openmpi < lammps.in > lammps.out 2> lammps.err
-</syntaxhighlight>
-=== Sample job script for hybrid parallel runs ===
-In summary, the job script's essential parts are:
-<syntaxhighlight lang="bash">
-#!/bin/bash
-#PBS -l nodes=2:ppn=8
-#PBS -l walltime=1:00:00
-...
-ppn_mpi=2		# user choice
-ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' )	# grab first (and usually only) ppn value of the job
-OMP_NUM_THREADS=$(( ppn_active / ppn_mpi ))			# calculate number of threads available per MPI process (integer arithmetic!)
-mpirun -x OMP_NUM_THREADS \
-    -machinefile  $PBS_NODEFILE \
-    --npernode $ppn_mpi \
-    lmp_openmpi \
-	-sf omp \
-	-in in.script
-</syntaxhighlight>
-//-->
 === References ===

HPC/Applications/lammps: Difference between revisions

Revision as of 22:00, November 5, 2012

Contents

Binaries

GPU support

Package GPU

Usage

Input file examples

Job file example

Package USER-CUDA

Usage

Input file example

Job file example

MPI/OpenMP hybrid parallel runs

Benchmark (pre-GPU version)

Diagnostic for hybrid parallel runs

References

Navigation menu

HPC/Applications/lammps: Difference between revisions

Revision as of 22:00, November 5, 2012

Binaries

GPU support

Package GPU

Usage

Input file examples

Job file example

Package USER-CUDA

Usage

Input file example

Job file example

MPI/OpenMP hybrid parallel runs

Benchmark (pre-GPU version)

Diagnostic for hybrid parallel runs

References

Navigation menu

Search