HPC/Benchmarks/Generation 1 vs 2

From CNM Wiki
Jump to navigation Jump to search

Introduction

Earlier this year, we received 200 additional nodes with E5540 processors. The processors have 8 cores, and support Hyperthreading, a feature which allows 2 threads per core. This benchmark investigates the benefit of hyperthreading (HT), and suggests optimal values for the nodes and processors per node (ppn) parameters in PBS.

Objective variables

tmax
The wallclock runtime of a job in seconds. This is the time metric used for calculating the job charge.
charge
The charge for the job, expressed in core-hours per application. Since the present test may use more than one application in a job, the value is scaled by the number of applications run within the job. This reflects the situation that the user will be interested in getting things done for minimal charge.

Note: In the near future, a charge factor will be introduced for gen2 nodes which will scale actual node-hours nodes to effective node hours, in a manner that levels the performance difference between the node generations. At present, no charge factor is applied.

Miscellaneous

charge_cores
the number of cores requested from the queing system, i.e., those blocked from use by other users.
perf
A measure for the performance per core. It should be proportional to the FLoating point Operations Per Second (FLOPS) achieved. Since the actual number of operations for the chosen workload is unknown, the value here is calculated using an arbitrary constant to produce convenient values:
   perf := 100000 * napps / tmax / charge_cores
HT
hyperthreading is in effect for this run

Parameters

gen
The node hardware generation
  • gen1 = Intel Xeon X5355, 2.66GHz, 8 cores per node, 16 GB RAM per node (2 GB/core)
  • gen2 = Intel Xeon E5540, 2.53GHz, 8 cores per node, 24 GB RAM per node (3 GB/core), Hyperthreading enabled
nodes
Number of nodes used for the PBS job.
ppn
processors per node used in the PB jobs.
napps
Number of applications run in the job in parallel. Normally, a user will run only one application. This test allows to run more than one application in parallel, equally subdividing the available cores on any participant node via the OpenMPI --npernode n flag.


Test description

The test runs /opt/soft/vasp-4.6.35-mkl-8/bin/vasp with the following workload (Credit: D. Shin, Northwestern Univ.).

INCAR

SYSTEM = Al12Mg17
ISTART = 0
ISMEAR = 1
SIGMA  = 0.1
ISIF   = 3
PREC   = HIGH
IBRION = 2
LWAVE  = .FALSE.
LCHARG = .FALSE.
LREAL  = .TRUE.
ENCUT  = 346

KPOINTS

KPOINTS file
0
Monkhorst-Pack
10 10 10
0 0 0

POSCAR

Al12Mg17
1.0000000000
-5.2719000000 5.2719000000 5.2719000000
5.2719000000 -5.2719000000 5.2719000000
5.2719000000 5.2719000000 -5.2719000000
12 17
Direct
0.3679000000 0.3679000000 0.1908000000    Al
0.1771000000 0.1771000000 0.8092000000    Al
0.6321000000 0.8229000000 0.0000000000    Al
0.8229000000 0.6321000000 0.0000000000    Al
0.1908000000 0.3679000000 0.3679000000    Al
0.3679000000 0.1908000000 0.3679000000    Al
0.0000000000 0.8229000000 0.6321000000    Al
0.1771000000 0.8092000000 0.1771000000    Al
0.8092000000 0.1771000000 0.1771000000    Al
0.8229000000 0.0000000000 0.6321000000    Al
0.0000000000 0.6321000000 0.8229000000    Al
0.6321000000 0.0000000000 0.8229000000    Al
0.3975000000 0.3975000000 0.7164000000    Mg
0.6811000000 0.6811000000 0.2836000000    Mg
0.6025000000 0.3189000000 0.0000000000    Mg
0.3189000000 0.6025000000 0.0000000000    Mg
0.7164000000 0.3975000000 0.3975000000    Mg
0.3975000000 0.7164000000 0.3975000000    Mg
0.0000000000 0.3189000000 0.6025000000    Mg
0.6811000000 0.2836000000 0.6811000000    Mg
0.2836000000 0.6811000000 0.6811000000    Mg
0.3189000000 0.0000000000 0.6025000000    Mg
0.0000000000 0.6025000000 0.3189000000    Mg
0.6025000000 0.0000000000 0.3189000000    Mg
0.6480000000 0.6480000000 0.6480000000    Mg
0.0000000000 0.0000000000 0.3520000000    Mg
0.3520000000 0.0000000000 0.0000000000    Mg
0.0000000000 0.3520000000 0.0000000000    Mg
0.0000000000 0.0000000000 0.0000000000    Mg

POTCAR

PAW_GGA Al 05Jan2001
3.00000000000000000
parameters from PSCTR are:
VRHFIN =Al: s2p1
LEXCH  = 91
EATOM  =    53.6910 eV,    3.9462 Ry
...

PAW_GGA Mg 05Jan2001
2.00000000000000000
parameters from PSCTR are:
VRHFIN =Mg: s2p0
LEXCH  = 91
EATOM  =    23.0823 eV,    1.6965 Ry
...

(abbreviated)

Results

Raw data

Observations

  • 4-core runs give a high numerical throughput in each node type (run=01 to 04)
  • gen2 nodes are fine for VASP with nodes=1:ppn=8; gen1 nodes are not (run=22 vs. 21)
  • Adding more nodes allows for the fastest run (run=54) or 40% slower and a better charge rate (run=52)
  • Running two apps in a single job is mostly not worth the effort of managing them (run=04 vs. 22)
  • HT allows for slightly better charge rates, but usually only with non-MPI jobs (or unsynced MPI jobs) (run=15, 25, 40, 55), and runtimes are nearly proportionately longer, making HT largely unattractive. This also holds for the only case tested for HT and napps=1 (run=50).

Recommendations

For the given workload, the following values for optimal performance with respect to the given objective can be recommended:

Node type Objective
time → min charge → min time × charge → min
gen1 nodes=4:ppn=3 nodes=1:ppn=4 nodes=3:ppn=4
run=35
tmax=503.05
charge=4.47
run=01
tmax=1138.83
charge=2.53
run=33
tmax=544.74
charge=3.63
gen2 nodes=4:ppn=4 nodes=1:ppn=8 nodes=2:ppn=8
run=54
tmax=237.59
charge=2.11
run=22
tmax=472.16
charge=1.05
run=52
tmax=329.10
charge=1.46

--stern