HPC/Applications/lammps: Difference between revisions

From CNM Wiki
Jump to navigation Jump to search
 
(48 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Binaries ==
== Binaries ==
As of module version lammps/2012-10-10-3 several LAMMPS binaries are provided within one module.
Several LAMMPS binaries are provided by the LAMMPS module, giving you options to run with MPI and/or a couple of GPU packages.
Earlier modules only contained one MPI binary.
Binaries compiled with GPU support will not run on nodes without a GPU
Binaries compiled with GPU support will not run on nodes without a GPU
(CUDA libraries are deliberately only installed on GPU nodes.)
(CUDA libraries are deliberately only installed on GPU nodes.)
Moreover, a binary built with the USER-CUDA package ''will'' attempt to access the GPU by default [http://lammps.sandia.gov/doc/Section_start.html#start_7].
Moreover, a binary built with the USER-CUDA package ''will'' attempt to access the GPU by default [http://lammps.sandia.gov/doc/Section_start.html#start_7].
{| class="wikitable" cellpadding="5" style="text-align:left;"
{| class="wikitable" cellpadding="5" style="text-align:left;"
|-
|-
! Binary name !! Description
! Binary name !! Description
|-
|-
| <code>'''lmp_openmpi-main'''</code> || The baseline binary, containing the packages shown by <code>module help lammps</code>.
| <code>'''lmp_openmpi-main'''</code> || The baseline binary, containing the packages shown by [[../#lammps | <code>module help lammps</code>]].
|-
|-
| <code>'''lmp_openmpi'''</code> || The distribution's default name; synonym for <code>lmp_openmpi-main</code>;
| <code>'''lmp_openmpi'''</code> || The distribution's default name; synonym for <code>lmp_openmpi-main</code>;
Line 19: Line 19:
| <code>'''lmp_openmpi-jr'''</code> || A custom build for user J.R.
| <code>'''lmp_openmpi-jr'''</code> || A custom build for user J.R.
|}
|}
Simply name the appropriate binary in the job file; full paths are neither necessary nor recommended.
To use the *-gpu and *=user-cuda binaries, load the [[../#cuda|<code>cuda</code>]] module in addition to lammps.
 
module load '''cuda'''
== GPU support ==
module load lammps
LAMMPS offers ''two different'' packages for using GPUs. These are fully documented in the LAMMPS manual, [http://lammps.sandia.gov/doc/Section_accelerate.html Section 5. Accelerating LAMMPS performance.] To use LAMMPS with GPUs on Carbon you must read and understand the relevant sections of that page.
To use any one of the binaries, simply name the appropriate one in the job file; full paths are neither necessary nor recommended.


eam/gpu
== Library linking ==
* Consult the [http://lammps.sandia.gov/doc/Section_howto.html#library-interface-to-lammps LAMMPS documentation]
* Carbon-specifics: To point your compiler and linker to the installed LAMMPS module, always use the environment variable <code>$LAMMPS_HOME</code>, never full path names. Edit the makefile of your application and add settings similar to the following:
<source lang="make">
CPPFLAGS += -I${LAMMPS_HOME}/include
FPPFLAGS += -I${LAMMPS_HOME}/include
LDFLAGS += -L$(FFTW3_HOME) -L${LAMMPS_HOME}/lib -llammps -lfftw3 -limf
</source>
: These settings refer to variables customarily used in makefiles for GNU Make. Your package might use different variables. Adapt as needed.
* The library created with the <code>-llammps</code> link option provides the same LAMMPS package set as the main binary, and supports MPI and OpenMP. It is actually equivalent to using:
-llammps_mpi-main
: This variant is available for both ''static'' (*.a) and ''dynamic'' (*.so) linkage but supports no GPUs.
* To use GPU nodes, link with one of the following instead:
-llammps_mpi-gpu
-llammps_mpi-user-cuda
: These variants are only available for ''static'' linkage.


=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_6 GPU package] ===
Inspect a sample code that uses LAMMPS as a library at:
* provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS.
<source lang="bash">
* restricted to one physical GPU per LAMMPS process.
cd ${LAMMPS_HOME}/src/examples/COUPLE/simple
* multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way.
less README Makefile
mpirun … lmp_openmpi-gpu -in ''infile''
</source>
* For a list of potentials supported by the GPU package:
*# In your browser, open http://lammps.sandia.gov/doc/Section_commands.html#comm
*# Search for the string '''/cuda'''.


To use the GPU package:
== GPU support ==
* Use the command [http://lammps.sandia.gov/doc/package.html <code>package gpu</code>] near the beginning of your LAMMPS control script. Since all Carbon GPU nodes have just one GPU per node, the first two arguments (called ''first'' and ''last'') must always be zero; the ''split'' argument is not restricted.
LAMMPS offers ''two different'' packages for using GPUs, one official, the other user-contributed.
* On of the following:
Only one of these packges can be used for a run.
** Append "gpu" to the style name (e.g. pair_style lj/cut/gpu).
The packages are fully documented in the following sections of the [http://lammps.sandia.gov/doc/Manual.html LAMMPS manual]:
** Use the [http://lammps.sandia.gov/doc/suffix.html suffix command].
* [http://lammps.sandia.gov/doc/Section_accelerate.html 5. Accelerating LAMMPS performance]
** Use the [http://lammps.sandia.gov/doc/Section_start.html#start_7 -suffix command-line switch].
<!-- ** [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_5 5.5 USER-OMP package] -->
** [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_6 5.6 GPU package]
** [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_7 5.7 USER-CUDA package]
** [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8 5.8 Comparison of GPU and USER-CUDA packages]


For example:
To use LAMMPS with GPUs on Carbon you must read and understand these sections. A summary and Carbon-specific details are given in the next section.
'''package gpu force/neigh 0 0 -1'''
pair_style      lj/charmm/coul/long'''/gpu''' 8.0 10.0


=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_7 USER-CUDA package] ===
=== Using GPU packages ===
* provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.<br>
* [[../lammps/Package GPU]]
* only supports use of a single CPU (core) with each GPU.
* [[../lammps/Package USER-CUDA]]
lmp_openmpi-user-cuda -suffix cuda -in ''infile''
* [[../lammps/Package OMP]] – if you really want to.


=== [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8 Comparison of GPU and USER-CUDA packages] ===
=== Jobs on Carbon ===
For sample PBS scripts, consult these files:
$LAMMPS_HOME/sample.job
$LAMMPS_HOME/sample-hybrid.job
* See also [[ HPC/Submitting and Managing Jobs/Example Job Script#GPU nodes]]


== Benchmark ==
== Benchmark (pre-GPU version) ==
Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on both node types.
Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on node types gen1 and gen2.


LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
Line 83: Line 99:
| gen2          || '''gen2''' || IB          || '''(none)'''                      || '''59'''    || fastest for gen2
| gen2          || '''gen2''' || IB          || '''(none)'''                      || '''59'''    || fastest for gen2
|}
|}
<!--
=== Sample job file gen1 ===
<syntaxhighlight lang="bash">
#!/bin/bash
#PBS -l nodes=10:ppn=8:gen1
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea
# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR
mpirun  -machinefile  $PBS_NODEFILE -np $PBS_NP \
        -mca btl self,tcp \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err
</syntaxhighlight>
=== Sample job file gen2 ===
<syntaxhighlight lang="bash">
#!/bin/bash
#PBS -l nodes=10:ppn=8:gen2
#PBS -l walltime=1:00:00:00
#PBS -N <jobname>
#PBS -A <account>
#
#PBS -o job.out
#PBS -e job.err
#PBS -m ea
# change into the directory where qsub will be executed
cd $PBS_O_WORKDIR
mpirun  -machinefile  $PBS_NODEFILE -np $PBS_NP \
        lmp_openmpi < lammps.in > lammps.out 2> lammps.err
</syntaxhighlight>
//-->
== MPI/OpenMP hybrid parallel runs ==
LAMMPS modules since 2012 are compiled with [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 <code>yes-user-omp</code>], permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs.
Be careful how to allocate CPU cores on compute nodes. Note the following:
* The number of cores on a node reserved for your use  is determined by the qsub <code>ppn=...</code> parameter.
* The number of MPI tasks (call it  <code>ppn_mpi</code>) running on a node is determined by options to mpirun.
* The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
* The number of physical cores per node for gen1 and gen2 nodes is 8.
* gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However:
** The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled.
** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming.
** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section):
<blockquote>
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
</blockquote>
<!-- ** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that
ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16
* Reserve ''entire'' nodes by adding near the top of the job file:
#PBS -l naccesspolicy=SINGLEJOB
-->
=== Sample job script for hybrid parallel runs ===
In summary, the job script's essential parts are:
<syntaxhighlight lang="bash">
#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=1:00:00
...
ppn_mpi=2 # user choice
ppn_pbs=$( uniq -c $PBS_NODEFILE | awk '{print $1; exit}' ) # grab first (and usually only) ppn value of the job
OMP_NUM_THREADS=$(( ppn_active / ppn_mpi )) # calculate number of threads available per MPI process (integer arithmetic!)
mpirun -x OMP_NUM_THREADS \
    -machinefile  $PBS_NODEFILE \
    --npernode $ppn_mpi \
    lmp_openmpi \
-sf omp \
-in in.script
</syntaxhighlight>


=== Diagnostic for hybrid parallel runs ===
=== Diagnostic for hybrid parallel runs ===

Latest revision as of 21:58, August 11, 2015

Binaries

Several LAMMPS binaries are provided by the LAMMPS module, giving you options to run with MPI and/or a couple of GPU packages. Earlier modules only contained one MPI binary. Binaries compiled with GPU support will not run on nodes without a GPU (CUDA libraries are deliberately only installed on GPU nodes.) Moreover, a binary built with the USER-CUDA package will attempt to access the GPU by default [1].

Binary name Description
lmp_openmpi-main The baseline binary, containing the packages shown by module help lammps.
lmp_openmpi The distribution's default name; synonym for lmp_openmpi-main;
lmp_openmpi-gpu The package "gpu" and all packages from main.
lmp_openmpi-user-cuda The package "user-cuda" and all packages from main.
lmp_openmpi-jr A custom build for user J.R.

To use the *-gpu and *=user-cuda binaries, load the cuda module in addition to lammps.

module load cuda
module load lammps

To use any one of the binaries, simply name the appropriate one in the job file; full paths are neither necessary nor recommended.

Library linking

  • Consult the LAMMPS documentation
  • Carbon-specifics: To point your compiler and linker to the installed LAMMPS module, always use the environment variable $LAMMPS_HOME, never full path names. Edit the makefile of your application and add settings similar to the following:
CPPFLAGS += -I${LAMMPS_HOME}/include
FPPFLAGS += -I${LAMMPS_HOME}/include
LDFLAGS += -L$(FFTW3_HOME) -L${LAMMPS_HOME}/lib -llammps -lfftw3 -limf
These settings refer to variables customarily used in makefiles for GNU Make. Your package might use different variables. Adapt as needed.
  • The library created with the -llammps link option provides the same LAMMPS package set as the main binary, and supports MPI and OpenMP. It is actually equivalent to using:
-llammps_mpi-main
This variant is available for both static (*.a) and dynamic (*.so) linkage but supports no GPUs.
  • To use GPU nodes, link with one of the following instead:
-llammps_mpi-gpu
-llammps_mpi-user-cuda
These variants are only available for static linkage.

Inspect a sample code that uses LAMMPS as a library at:

cd ${LAMMPS_HOME}/src/examples/COUPLE/simple
less README Makefile

GPU support

LAMMPS offers two different packages for using GPUs, one official, the other user-contributed. Only one of these packges can be used for a run. The packages are fully documented in the following sections of the LAMMPS manual:

To use LAMMPS with GPUs on Carbon you must read and understand these sections. A summary and Carbon-specific details are given in the next section.

Using GPU packages

Jobs on Carbon

For sample PBS scripts, consult these files:

$LAMMPS_HOME/sample.job
$LAMMPS_HOME/sample-hybrid.job

Benchmark (pre-GPU version)

Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on node types gen1 and gen2.

LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).

Job tag Node type Interconnect Additional OpenMPI options Relative speed
(1000 steps/3 hours)
Notes
gen1 gen1 IB (none) 36
gen1srqpin gen1 IB -mca btl_openib_use_srq 1
-mca mpi_paffinity_alone 1
39
gen1eth gen1 Ethernet -mca btl self,tcp 44 fastest for gen1
gen2eth gen2 Ethernet -mca btl self,tcp 49
gen2srq gen2 IB -mca btl_openib_use_srq 1 59
gen2 gen2 IB (none) 59 fastest for gen2

Diagnostic for hybrid parallel runs

  • LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012)
  using 4 OpenMP thread(s) per MPI task
...
  1 by 2 by 2 MPI processor grid
  104 atoms
...

and near the end:

Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
  • To see if OpenMP is really active, log into a compute node while a job is running and run top or psuser – The %CPU field should be about OMP_NUM_THREADS × 100%
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                             
8047 stern     25   0 4017m  33m 7540 R 401.8  0.1   1:41.60 lmp_openmpi                                                                                         
8044 stern     25   0 4017m  33m 7540 R 399.9  0.1   1:43.50 lmp_openmpi                                                                                         
4822 root      34  19     0    0    0 S  2.0  0.0 115:34.98 kipmi0

References