|
|
Line 39: |
Line 39: |
| * Each GPU node has 12 cores; if you submit jobs with <code>:ppn < 12</code> and <code>:gpus=1</code> the node may be shared with purely CPU jobs. It is to be tested if and how much interference this causes for either job. See [[HPC/Submitting and Managing Jobs/Advanced node selection|Advanced node selection]] to reserve entire nodes while controlling <code>ppn</code> for MPI or OpenMP. | | * Each GPU node has 12 cores; if you submit jobs with <code>:ppn < 12</code> and <code>:gpus=1</code> the node may be shared with purely CPU jobs. It is to be tested if and how much interference this causes for either job. See [[HPC/Submitting and Managing Jobs/Advanced node selection|Advanced node selection]] to reserve entire nodes while controlling <code>ppn</code> for MPI or OpenMP. |
|
| |
|
| == Package GPU == | | == Package-specific hints == |
| * Provides multi-threaded versions of most pair styles, all dihedral styles and a few fixes in LAMMPS; for the full list:
| | [[../lammps/Package GPU]] |
| *# In your browser, open http://lammps.sandia.gov/doc/Section_commands.html#comm
| | [[../lammps/Package USER-CUDA]] |
| *# Search for the string '''/cuda'''.
| | [[../lammps/Package OMP]] |
| * Supports one physical GPU per LAMMPS MPI process (CPU core).
| |
| * Multiple MPI processes (CPU cores) can share a single GPU, and in many cases it will be more efficient to run this way.
| |
| | |
| === Usage ===
| |
| # Use the command [http://lammps.sandia.gov/doc/package.html '''<code>package gpu</code>'''] near the beginning of your LAMMPS control script. Since all Carbon GPU nodes have just one GPU per node, the first two arguments (called ''first'' and ''last'') must always be zero; the ''split'' argument is not restricted.
| |
| # Do one of the following: <!-- In your LAMMPS control script, -->
| |
| #:* Append '''/gpu''' to the style name (e.g. pair_style lj/cut/gpu).
| |
| #:* Use the [http://lammps.sandia.gov/doc/suffix.html '''suffix gpu''' command].
| |
| #:* On the command line, use the [http://lammps.sandia.gov/doc/Section_start.html#start_7 '''-suffix gpu''' switch].
| |
| # In the job file or qsub command line, [http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml#resources request a GPU] <code>#PBS -l nodes=...:gpus=1</code> (referring to the number of GPUs per node).
| |
| # Call the <code>lmp_openmpi'''-gpu'''</code> binary.
| |
| | |
| === Input file examples ===
| |
| package gpu force 0 0 1.0
| |
| package gpu force 0 0 0.75
| |
| package gpu force/neigh 0 0 1.0
| |
| package gpu force/neigh 0 1 -1.0
| |
| | |
| …
| |
| pair_style lj/charmm/coul/long'''/gpu''' 8.0 10.0
| |
| | |
| === Job file example ===
| |
| #PBS -l nodes=...''':gpus=1'''
| |
| …
| |
| mpirun … lmp_openmpi'''-gpu''' -in ''infile''
| |
| | |
| == Package USER-CUDA ==
| |
| * Provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command.<br>
| |
| * Only supports a single CPU (core) with each GPU [That should mean multiple nodes are possible; feasibility and efficiency to be determined --[[User:Stern|stern]] ]
| |
| | |
| === Usage ===
| |
| # Optional: Use the command [http://lammps.sandia.gov/doc/package.html '''<code>package cuda</code>'''] near the beginning of your LAMMPS control script to finely control settings. This is optional since a LAMMPS binary with USER-CUDA always detects and uses a GPU by default.
| |
| # Do one of the following:
| |
| #:* Append '''/cuda''' to the style name (e.g. pair_style lj/cut/cuda)
| |
| #:* Use the [http://lammps.sandia.gov/doc/suffix.html '''suffix cuda''' command].
| |
| #:* On the command line, use the [http://lammps.sandia.gov/doc/Section_start.html#start_7 '''-suffix cuda''' switch].
| |
| # Optional: The [http://lammps.sandia.gov/doc/kspace_style.html kspace_style pppm'''/cuda'''] command has to be requested explicitly. [I am not sure if that means that other k-space styles ''implicitly'' use the GPU --[[User:Stern|stern]]. ]
| |
| # In the job file or qsub command line, [http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml#resources request a GPU] <code>#PBS -l nodes=...:gpus=1</code>.
| |
| # Call the <code>lmp_openmpi'''-user-cuda'''</code> binary.
| |
| | |
| === Input file example ===
| |
| Examples:
| |
| package cuda gpu/node/special 2 0 2
| |
| package cuda test 3948
| |
| | |
| …
| |
| kspace_style pppm'''/cuda''' 1e-5
| |
| | |
| === Job file example ===
| |
| * Serial job:
| |
| #PBS -l nodes=1:ppn=1''':gpus=1'''
| |
| …
| |
| lmp_openmpi'''-user-cuda''' -suffix cuda -in ''infile''
| |
| * Parallel job; note that ppn must still be 1 as only one LAMMPS process (core) per node can use the sole GPU.
| |
| #PBS -l nodes=3:ppn=1''':gpus=1'''
| |
| …
| |
| mpirun -machinefile $PBS_NODEFILE -np $PBS_NP lmp_openmpi'''-user-cuda''' -suffix cuda -in ''infile''
| |
| | |
| == MPI/OpenMP hybrid parallel runs ==
| |
| LAMMPS modules since 2012 are compiled with [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 <code>yes-user-omp</code>], permitting multi-threaded runs of selected pair styles, and in particular MPI/OpenMP hybrid parallel runs. To set up such runs, see [[HPC/Submitting and Managing Jobs/Advanced node selection]].
| |
| <!--
| |
| Be careful how to allocate CPU cores on compute nodes.
| |
| * The number of cores on a node reserved for your use is determined by the qsub <code>ppn=...</code> parameter.
| |
| * The number of MPI tasks (call it <code>ppn_mpi</code>) running on a node is determined by options to mpirun.
| |
| * The number of threads that each MPI task runs with is determined by the environment variable <code>OMP_NUM_THREADS</code>, which is 1 by default on Carbon.
| |
| * The number of physical cores per node for gen1 and gen2 nodes is 8, and 12 for gen3.
| |
| * gen2 nodes have [http://en.wikipedia.org/wiki/Hyperthreading hyperthreading] active, meaning there are 16 ''logical'' cores per node. However:
| |
| ** The method shown below cannot consistenly use hyperthreading since PBS is told that nodes have exactly 8 cores. ppn requests higher than that cannot be fulfilled.
| |
| ** My ([[User:Stern|stern]]) own [[HPC/Benchmarks/Generation_1_vs_2#Hyperthreading_and_node-sharing | benchmarks for a memory-intensive DFT program]] were underwhelming.
| |
| ** The LAMMPS OpenMP author [http://lammps.sandia.gov/doc/Section_accelerate.html#acc_2 reports the same] (near the end of the section):
| |
| <blockquote>
| |
| Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
| |
| </blockquote>
| |
| ** A mild benefit may be conferred for partial hyperthreading. Choose <code>ppn_active</code> such that
| |
| ppn_physical = 8 ≤ ppn_active ≤ ppn_logical = 16
| |
| * Reserve ''entire'' nodes by adding near the top of the job file:
| |
| #PBS -l naccesspolicy=SINGLEJOB
| |
| -->
| |
|
| |
|
| For sample PBS scripts, consult these files: | | For sample PBS scripts, consult these files: |
Binaries
As of module version lammps/2012-10-10-3 (which currently is the default) several LAMMPS binaries are provided within one module.
Binaries compiled with GPU support will not run on nodes without a GPU
(CUDA libraries are deliberately only installed on GPU nodes.)
Moreover, a binary built with the USER-CUDA package will attempt to access the GPU by default [1].
Binary name |
Description
|
lmp_openmpi-main |
The baseline binary, containing the packages shown by module help lammps .
|
lmp_openmpi |
The distribution's default name; synonym for lmp_openmpi-main ;
|
lmp_openmpi-gpu |
The package "gpu" and all packages from main.
|
lmp_openmpi-user-cuda |
The package "user-cuda" and all packages from main.
|
lmp_openmpi-jr |
A custom build for user J.R.
|
To use the *-gpu and *=user-cuda binaries, load the cuda
module in addition to lammps.
module load cuda lammps
To use any one of the binaries, simply name the appropriate one in the job file; full paths are neither necessary nor recommended.
GPU support
LAMMPS offers two different packages for using GPUs, one official, the other user-contributed.
Only one of these packges can be used for a run.
The packages are fully documented in the following sections of the LAMMPS manual:
To use LAMMPS with GPUs on Carbon you must read and understand these sections. A summary and Carbon-specific details are given in the following two sections.
- General note on GPU jobs
- To request your job to run on a GPU node use in the job file:
#PBS -l nodes=…:gpus=1
At the moment this is synonymous with but preferable to:
#PBS -l nodes=…:gen3
- Each GPU node has 12 cores; if you submit jobs with
:ppn < 12
and :gpus=1
the node may be shared with purely CPU jobs. It is to be tested if and how much interference this causes for either job. See Advanced node selection to reserve entire nodes while controlling ppn
for MPI or OpenMP.
Package-specific hints
HPC/Applications/lammps/Package GPU
HPC/Applications/lammps/Package USER-CUDA
HPC/Applications/lammps/Package OMP
For sample PBS scripts, consult these files:
$LAMMPS_HOME/sample.job
$LAMMPS_HOME/sample-hybrid.job
Benchmark (pre-GPU version)
Using a sample workload from Sanket ("run9"), I tested various OpenMPI options on node types gen1 and gen2.
LAMMPS performs best on gen2 nodes without extra options, and pretty well on gen1 nodes over ethernet(!).
Job tag |
Node type |
Interconnect |
Additional OpenMPI options |
Relative speed (1000 steps/3 hours) |
Notes
|
gen1 |
gen1 |
IB |
(none) |
36
|
gen1srqpin |
gen1 |
IB |
-mca btl_openib_use_srq 1 -mca mpi_paffinity_alone 1 |
39
|
gen1eth |
gen1 |
Ethernet |
-mca btl self,tcp |
44 |
fastest for gen1
|
gen2eth |
gen2 |
Ethernet |
-mca btl self,tcp |
49
|
gen2srq |
gen2 |
IB |
-mca btl_openib_use_srq 1 |
59
|
gen2 |
gen2 |
IB |
(none) |
59 |
fastest for gen2
|
Diagnostic for hybrid parallel runs
- LAMMPS echoes it parallelization scheme first thing in the output:
LAMMPS (10 Feb 2012)
using 4 OpenMP thread(s) per MPI task
...
1 by 2 by 2 MPI processor grid
104 atoms
...
and near the end:
Loop time of 124.809 on 16 procs (4 MPI x 4 OpenMP) for 30000 steps with 104 atoms
- To see if OpenMP is really active, log into a compute node while a job is running and run
top
or psuser
– The %CPU
field should be about OMP_NUM_THREADS × 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8047 stern 25 0 4017m 33m 7540 R 401.8 0.1 1:41.60 lmp_openmpi
8044 stern 25 0 4017m 33m 7540 R 399.9 0.1 1:43.50 lmp_openmpi
4822 root 34 19 0 0 0 S 2.0 0.0 115:34.98 kipmi0
References