HPC/Benchmarks/Generation 1 vs 2: Difference between revisions

From CNM Wiki
< HPC
Jump to navigation Jump to search
 
(95 intermediate revisions by the same user not shown)
Line 1: Line 1:
{| align="right"
| __TOC__
|}
== Introduction ==
== Introduction ==
Earlier this year, we received 200 additional nodes with E5540 processors.  The processors have 8 cores, and support Hyperthreading, a feature which allows 2 threads per core.  This benchmark investigates the benefit of hyperthreading (HT), and suggests optimal values for the ''nodes'' and ''processors per node (ppn)'' parameters in PBS.
In early 2010, we received 200 additional nodes with two E5540 processors each.  The processors have 4 cores each, and support hyperthreading, a feature which allows 2 threads per core.  This benchmark investigates the benefit of hyperthreading (HT), and suggests optimal values for the ''nodes'' and ''ppn'' (processors per node) parameters in PBS.  The choice is not trivial and involves tradeoffs between various metrics such as execution time or compute-hours charged.


== Test description ==
== Test description ==
The test runs <code>/opt/soft/vasp-4.6.35-mkl-8/bin/vasp</code> with th following workload (Credit: D. Shin, Northwestern Univ.).
This test runs <code>/opt/soft/vasp-4.6.35-mkl-8/bin/vasp</code>.
* [[HPC/Benchmarks/Generation 1 vs 2/vasp/input | Full input data]] (Credit: D. Shin, Northwestern University).


=== INCAR ===
== Explanation of data columns ==
<pre>
=== Parameters ===
SYSTEM = Al12Mg17
; cores/app: The number of cores that a single application workload is executed on. Typically, this value is used in studies of the ''parallel scaling'' (parallelization efficiency) of an application.
ISTART = 0
; napps: The number of applications run in parallel within the job. Typically, a user will run only one (MPI) application at a timeThis benchmark allows to run more than one application in parallel, equally ''subdividing'' the available cores on any participant node via the OpenMPI <code>--npernode ''n'' </code> flag. Motivation: As a hypothesis, I considered it possible to be beneficial to run several (related or unrelated) applications on a processor, but not within the same MPI job.  Different workloads would minimize the chance of congestion of the processors' pipelining architecture.
ISMEAR = 1
; gen: The node hardware generation
SIGMA  = 0.1
:* gen1 = Intel Xeon X5355, 2.66GHz, 8 cores per node, 16 GB RAM per node (2 GB/core)
ISIF  = 3
:* gen2 = Intel Xeon E5540, 2.53GHz, 8 cores per node, 24 GB RAM per node (3 GB/core), hyperthreading ''enabled'' in hardware (BIOS)
PREC  = HIGH
; nodes: Number of nodes requested from the queing system.
IBRION = 2
; ppn: Processors per node requested from the queing system.
LWAVE = .FALSE.
LCHARG = .FALSE.
LREAL = .TRUE.
ENCUT  = 346
</pre>


=== KPOINTS ===
=== Objective variables ===
<pre>
KPOINTS file
0
Monkhorst-Pack
10 10 10
0 0 0
</pre>


=== POSCAR ===
; tmax → min: The wallclock runtime of a job in seconds. When multiple applications are run, the longest runtime is used. This is the time metric used for calculating the job charge.
<pre>
; charge → min: The charge for the job, expressed in core-hours per application. Since the present test may use more than one application in a job, the value is normalized by the number of applications run within the job. This reflects the situation that the user will be interested in getting things done for minimal charge.
Al12Mg17
: '''Note:''' In the near future, a ''charge factor'' will be introduced for gen2 nodes which will scale ''actual'' node-hours to ''effective'' node hours, in a manner that levels the performance difference between the node generations. At present, no charge factor is applied.
1.0000000000
; combined objective tmax × charge → min:  An empirical objective to minimize ''both'' time and charge. Since the charge is proportional to time, the time effectively enters quadratically.
-5.2719000000 5.2719000000 5.2719000000
; perf  → max: A measure for the performance per core. It should be proportional to the [http://en.wikipedia.org/wiki/FLOPS FLoating point Operations Per Second (FLOPS)] achieved. Since the actual number of operations for the chosen workload is unknown, the value here is calculated using an arbitrary constant to produce convenient values:
5.2719000000 -5.2719000000 5.2719000000
    perf := 100000 * napps / tmax / charge_cores
5.2719000000 5.2719000000 -5.2719000000
12 17
Direct
0.3679000000 0.3679000000 0.1908000000    Al
0.1771000000 0.1771000000 0.8092000000    Al
0.6321000000 0.8229000000 0.0000000000    Al
0.8229000000 0.6321000000 0.0000000000    Al
0.1908000000 0.3679000000 0.3679000000    Al
0.3679000000 0.1908000000 0.3679000000    Al
0.0000000000 0.8229000000 0.6321000000    Al
0.1771000000 0.8092000000 0.1771000000    Al
0.8092000000 0.1771000000 0.1771000000    Al
0.8229000000 0.0000000000 0.6321000000    Al
0.0000000000 0.6321000000 0.8229000000    Al
0.6321000000 0.0000000000 0.8229000000    Al
0.3975000000 0.3975000000 0.7164000000    Mg
0.6811000000 0.6811000000 0.2836000000    Mg
0.6025000000 0.3189000000 0.0000000000    Mg
0.3189000000 0.6025000000 0.0000000000    Mg
0.7164000000 0.3975000000 0.3975000000    Mg
0.3975000000 0.7164000000 0.3975000000    Mg
0.0000000000 0.3189000000 0.6025000000    Mg
0.6811000000 0.2836000000 0.6811000000    Mg
0.2836000000 0.6811000000 0.6811000000    Mg
0.3189000000 0.0000000000 0.6025000000    Mg
0.0000000000 0.6025000000 0.3189000000    Mg
0.6025000000 0.0000000000 0.3189000000    Mg
0.6480000000 0.6480000000 0.6480000000    Mg
0.0000000000 0.0000000000 0.3520000000    Mg
0.3520000000 0.0000000000 0.0000000000    Mg
0.0000000000 0.3520000000 0.0000000000    Mg
0.0000000000 0.0000000000 0.0000000000    Mg
</pre>


=== POTCAR ===
=== Miscellaneous ===
<pre>
PAW_GGA Al 05Jan2001
3.00000000000000000
parameters from PSCTR are:
VRHFIN =Al: s2p1
LEXCH  = 91
EATOM  =    53.6910 eV,    3.9462 Ry
...


PAW_GGA Mg 05Jan2001
; run: a sequence number, used to identify the run and its various files and directories.
2.00000000000000000
; charge_cores: the number of cores requested from the queing system, i.e., those ''blocked'' from use by other users. The tests within the benchmark ran with the qsub option  (lowercase "ell") <code>-l naccesspolicy=singlejob</code>.
parameters from PSCTR are:
; HT: hyperthreading is in effect for this run
VRHFIN =Mg: s2p0
LEXCH  = 91
EATOM  =    23.0823 eV,   1.6965 Ry
...
</pre>
(appbreviated)


== Data ==
== Results ==
* [[HPC/Generation-2 nodes/vasp/vasp.lst]], grep-able
=== Data ===
* [[media:Vasp.txt]], tab-separated CSV
==== Files ====
* [[media:Vasp.pdf]], PDF
* [[HPC/Generation-2 nodes/vasp/vasp.lst | Raw data, grep-able]]
* [[media:Vasp.txt | Raw data, tab-separated CSV]]
* [[media:Vasp.pdf | Analysis (PDF)]], with comparisons made by tmax, charge, and both.  The last page contains a direct comparison of the performance of gen1 vs. gen2 nodes.


== Observations ==
==== Gen1 nodes ====
* 4-core runs give a high numerical throughput in each node type  (run=01 to 04)
{| class="wikitable sortable" cellpadding="4" style="text-align:center;  margin: 1em auto 1em auto;"
* gen2 nodes are fine for VASP with nodes=1:ppn=8; gen1 nodes are not (run=22 vs. 21)
|- style="background:#eee;"
* Adding more nodes allows for the fastest run (run=54) or 40% slower and a better charge rate (run=52)
! cores/app<br><br> !! run<br><br> !! nodes<br><br> !! ppn<br><br> !! napps<br><br> !! tmax<br><br> !! charge-<br>cores<br> !! charge raw<br>(core-h/app)<br> !! perf<br>(ops/s/core)<br> !! combined<br>objective<br> !! HT<br><br>
* Running two apps in a single job is mostly not worth the effort of managing them (run=04 vs. 22)
|-
* HT allows for slightly better charge rates, but usually only with non-MPI jobs (or unsynced MPI jobs) (run=15, 25, 40, 55), and runtimes are nearly proportionately longer, making HT largely unattractive.  This also holds for the only case tested for HT and napps=1 (run=50).
| 4 || 1 || <font color="#0a0">'''1'''</font> || <font color="#0a0">'''4'''</font> || 1 || 1138.83 || 8 || <font color="#0a0">'''2.53'''</font> || <font color="#0a0">'''11.0'''</font> || 28.82 || 
|- style="color:#ccc;"
| 4 || 3 || 1 || 8 || 2 || 2488.86 || 8 || 2.77 || 10.0 || 68.83 || 
|-
| 6 || 11 || 1 || 6 || 1 || 1566.07 || 8 || 3.48 || 8.0 || 54.50 || 
|-
| 6 || 13 || 2 || 3 || 1 || 816.61 || 16 || 3.63 || 7.7 || 29.64 || 
|- style="color:#ccc;"
| 6 || 16 || 2 || 6 || 2 || 1401.48 || 16 || 3.11 || 8.9 || 43.65 || 
|-
| 8 || 21 || 1 || 8 || 1 || 1488.72 || 8 || 3.31 || 8.4 || 49.25 || 
|-
| 8 || 23 || 2 || 4 || 1 || 791.37 || 16 || 3.52 || 7.9 || 27.83 || 
|- style="color:#ccc;"
| 8 || 26 || 2 || 8 || 2 || 1494.81 || 16 || 3.32 || 8.4 || 49.65 || 
|-
| 12 || 31 || 2 || 6 || 1 || 838.08 || 16 || 3.72 || 7.5 || 31.22 || 
|-
| 12 || 33 || <font color="#0a0">'''3'''</font> || <font color="#0a0">'''4'''</font> || 1 || 544.74 || 24 || 3.63 || 7.6 || <font color="#0a0">'''19.78'''</font> || 
|-
| 12 || 35 || <font color="#0a0">'''4'''</font> || <font color="#0a0">'''3'''</font> || 1 || <font color="#0a0">'''503.05'''</font> || 32 || 4.47 || 6.2 || 22.49 || 
|- style="color:#ccc;"
| 12 || 41 || 3 || 8 || 2 || 1117.20 || 24 || 3.72 || 7.5 || 41.60 || 
|- style="color:#ccc;"
| 12 || 43 || 4 || 6 || 2 || 838.51 || 32 || 3.73 || 7.5 || 31.25 || 
|-
| 16 || 51 || 2 || 8 || 1 || 998.73 || 16 || 4.44 || 6.3 || 44.33 || 
|-
| 16 || 53 || 4 || 4 || 1 || 522.50 || 32 || 4.64 || 6.0 || 24.27 || 
|- style="color:#ccc;"
| 16 || 56 || 4 || 8 || 2 || 1626.13 || 32 || 7.23 || 3.8 || 117.52 || 
|}
 
==== Gen2 nodes ====
{| class="wikitable sortable" cellpadding="4" style="text-align:center;  margin: 1em auto 1em auto;"
|- style="background:#eee;"
! cores/app<br><br> !! run<br><br> !! nodes<br><br> !! ppn<br><br> !! napps<br><br> !! tmax<br><br> !! charge-<br>cores<br> !! charge raw<br>(core-h/app)<br> !! perf<br>(ops/s/core)<br> !! combined<br>objective<br> !! HT<br><br>
|-
| 4 || 2 || 1 || 4 || 1 || 516.69 || 8 || 1.15 || 24.2 || 5.93 || 
|- style="color:#ccc;"
| 4 || 4 || 1 || 8 || 2 || 767.26 || 8 || 0.85 || 32.6 || 6.54 || 
|-
| 6 || 12 || <font color="#0a0">'''1'''</font> || <font color="#0a0">'''6'''</font> || 1 || 470.10 || 8 || <font color="#0a0">'''1.04'''</font> || <font color="#0a0">'''26.6'''</font> || 4.91 || 
|-
| 6 || 14 || 2 || 3 || 1 || 447.01 || 16 || 1.99 || 14.0 || 8.88 || 
|- style="color:#ccc;"
| 6 || 15 || 1 || 12 || 2 || 867.12 || 8 || 0.96 || 28.8 || 8.35 || HT
|- style="color:#ccc;"
| 6 || 17 || 2 || 6 || 2 || 587.55 || 16 || 1.31 || 21.3 || 7.67 || 
|-
| 8 || 22 || <font color="#0a0">'''1'''</font> || <font color="#0a0">'''8'''</font> || 1 || 472.16 || 8 || <font color="#0a0">'''1.05'''</font> || <font color="#0a0">'''26.5'''</font> || 4.95 || 
|-
| 8 || 24 || 2 || 4 || 1 || 426.55 || 16 || 1.90 || 14.7 || 8.09 || 
|- style="color:#ccc;"
| 8 || 25 || 1 || 16 || 2 || 927.22 || 8 || 1.03 || 27.0 || 9.55 || HT
|- style="color:#ccc;"
| 8 || 27 || 2 || 8 || 2 || 596.30 || 16 || 1.33 || 21.0 || 7.90 || 
|-
| 12 || 30 || 1 || 12 || 1 || 565.44 || 8 || 1.26 || 22.1 || 7.10 || HT
|-
| 12 || 32 || 2 || 6 || 1 || 330.88 || 16 || 1.47 || 18.9 || 4.87 || 
|-
| 12 || 34 || 3 || 4 || 1 || 270.06 || 24 || 1.80 || 15.4 || 4.86 || 
|-
| 12 || 36 || 4 || 3 || 1 || 267.81 || 32 || 2.38 || 11.7 || 6.38 || 
|- style="color:#ccc;"
| 12 || 40 || 2 || 12 || 2 || 582.08 || 16 || 1.29 || 21.5 || 7.53 || HT
|- style="color:#ccc;"
| 12 || 42 || 3 || 8 || 2 || 383.44 || 24 || 1.28 || 21.7 || 4.90 || 
|- style="color:#ccc;"
| 12 || 44 || 4 || 6 || 2 || 321.16 || 32 || 1.43 || 19.5 || 4.58 || 
|-
| 16 || 50 || 1 || 16 || 1 || 592.26 || 8 || 1.32 || 21.1 || 7.79 || HT
|-
| 16 || 52 || <font color="#0a0">'''2'''</font> || <font color="#0a0">'''8'''</font> || 1 || 329.10 || 16 || 1.46 || 19.0 || <font color="#0a0">'''4.81'''</font> || 
|-
| 16 || 54 || <font color="#0a0">'''4'''</font> || <font color="#0a0">'''4'''</font> || 1 || <font color="#0a0">'''237.59'''</font> || 32 || 2.11 || 13.2 || 5.02 || 
|- style="color:#ccc;"
| 16 || 55 || 2 || 16 || 2 || 601.78 || 16 || 1.34 || 20.8 || 8.05 || HT
|- style="color:#ccc;"
| 16 || 57 || 4 || 8 || 2 || 327.90 || 32 || 1.46 || 19.1 || 4.78 || 
|}
 
=== Observations ===
* For the same core constellation in a workload, '''gen2 nodes are 2...3 times faster than gen1 nodes''' – see last page in the [https://wiki.anl.gov/wiki_cnm/images/e/e7/Vasp.pdf analysis (PDF)].
* 4-core runs give the highest numerical throughput in each node type  (run=01 to 04).
* gen2 nodes are fine for VASP with nodes=1:ppn=8; gen1 nodes are not (run=22 vs. 21).
 
=== Hyperthreading and node-sharing ===
* Naïve application of hyperthreading (HT) is detrimental - it leads to ''increased'' runtimes and thus ''increased'' charges.
** run=50 (ppn=16), charge 26% higher than run=22 (ppn=8)
** run=30 (ppn=12), charge 20% higher than run=22 (ppn=8)
* However, when non-MPI jobs or two ''unsynced'' MPI jobs are running (run=15, 25, 17), HT yields 10...20% ''lower'' charge rates.  This is a pittance and makes HT largely unattractive.  Unsynced MPI jobs (i.e., sharing nodes) is mildly beneficial in all cases, whether hyperthreading is used or not (runs=04, 15, 25; see pg. 4 in the PDF.  The best-case scenario gives a charge savings of -24% (run=15 vs. 30, at nodes=1:ppn=12).  This is even the case on gen1 nodes (e.g., runs=26 vs. 51, at nodes=2:ppn=8,  charge savings = -25%).
* Running two apps in a single job is mostly not worth the effort of managing them.
* Sharing nodes confers a mild benefit and should be rewarded in a charge discount.


== Recommendations ==
== Recommendations ==
'''For the given workload,''' the I recommend the following values for optimal performance with respect to the given objective.
''For the given workload,'' the following values for optimal performance with respect to the given objective can be recommended:


{| class="wikitable" cellpadding="5" style="text-align:center;  margin: 1em auto 1em auto;"
{| class="wikitable" cellpadding="5" style="text-align:center;  margin: 1em auto 1em auto;"
Line 114: Line 155:
| width="200px" | time  × charge → min
| width="200px" | time  × charge → min
|-
|-
|  gen1  ||  <font color="blue"><code>nodes=4:ppn=3</code></font>  ||  <font color="blue"><code>nodes=1:ppn=4</code></font>  ||  <font color="blue"><code>nodes=3:ppn=4</code></font>
|  gen1  ||  <font color="#0a0">'''<code>nodes=4:ppn=3:gen1</code>'''</font>  ||  <font color="#0a0">'''<code>nodes=1:ppn=4:gen1</code>'''</font>  ||  <font color="#0a0">'''<code>nodes=3:ppn=4:gen1</code>'''</font>
|-
|-
|
|
Line 121: Line 162:
| ''run=33<br>tmax=544.74<br>charge=3.63''
| ''run=33<br>tmax=544.74<br>charge=3.63''
|-
|-
|  gen2  ||  <font color="blue"><code>nodes=4:ppn=4</code></font>  ||  <font color="blue"><code>nodes=1:ppn=8</code></font>  ||  <font color="blue"><code>nodes=2:ppn=8</code></font>
|  gen2  ||  <font color="#0a0">'''<code>nodes=4:ppn=4:gen2</code>'''</font>  ||  <font color="#0a0">'''<code>nodes=1:ppn=8:gen2</code>'''</font>  ||  <font color="#0a0">'''<code>nodes=2:ppn=8:gen2</code>'''</font>
|-
|-
|
|
Line 129: Line 170:
|-
|-
|}
|}
<pre>
The last column is an empirically formulated objective to minimize both time and charge.  Compared to the minimum-charge runs (Objective columns 2, runs 01 and 22, respectively), adding nodes will reduce the runtime and only slighty increase the charge. The fastest runs (first Objective column) use the same number of cores for calculation, but since the number of nodes is higher, so is charge_cores, and thus the job charge is also higher.
 
== Update ==
=== Adjusting VASP input ===
The [http://cms.mpi.univie.ac.at/vasp/vasp/vasp.html VASP manual] recommends [http://cms.mpi.univie.ac.at/vasp/vasp/node146.html NPAR] to ''number of nodes'' (understood to be number of cores, since at the time single-core CPUs were all the rage).  Nowadays, it would be better to set
NPAR = number of nodes
meaning the ''number of SMP nodes'', which is the current state of the art.
 
(Credit: M. Chan).
 
=== Charge rates ===
Given the above benchmark results, the charge rate on gen1 nodes was set to 0.5 early on (soon after the gen2 nodes were put into production). This means that a task will incur approximately the same charge when run on either gen1 or gen2 nodes. Short tasks even have a certain advantage to run on gen1 nodes, since often the wait time there is lower, while the longer execution time matters less for the user (being small to begin with).
 
Charges in the benchmark above are raw, without this factor, i.e, ''charge cores <math>\times</math> wall time'', where ''charge cores'' is the number of cores reserved by Moab.
 
--[[User:Stern|stern]]

Latest revision as of 21:55, May 24, 2011

Introduction

In early 2010, we received 200 additional nodes with two E5540 processors each. The processors have 4 cores each, and support hyperthreading, a feature which allows 2 threads per core. This benchmark investigates the benefit of hyperthreading (HT), and suggests optimal values for the nodes and ppn (processors per node) parameters in PBS. The choice is not trivial and involves tradeoffs between various metrics such as execution time or compute-hours charged.

Test description

This test runs /opt/soft/vasp-4.6.35-mkl-8/bin/vasp.

Explanation of data columns

Parameters

cores/app
The number of cores that a single application workload is executed on. Typically, this value is used in studies of the parallel scaling (parallelization efficiency) of an application.
napps
The number of applications run in parallel within the job. Typically, a user will run only one (MPI) application at a time. This benchmark allows to run more than one application in parallel, equally subdividing the available cores on any participant node via the OpenMPI --npernode n flag. Motivation: As a hypothesis, I considered it possible to be beneficial to run several (related or unrelated) applications on a processor, but not within the same MPI job. Different workloads would minimize the chance of congestion of the processors' pipelining architecture.
gen
The node hardware generation
  • gen1 = Intel Xeon X5355, 2.66GHz, 8 cores per node, 16 GB RAM per node (2 GB/core)
  • gen2 = Intel Xeon E5540, 2.53GHz, 8 cores per node, 24 GB RAM per node (3 GB/core), hyperthreading enabled in hardware (BIOS)
nodes
Number of nodes requested from the queing system.
ppn
Processors per node requested from the queing system.

Objective variables

tmax → min
The wallclock runtime of a job in seconds. When multiple applications are run, the longest runtime is used. This is the time metric used for calculating the job charge.
charge → min
The charge for the job, expressed in core-hours per application. Since the present test may use more than one application in a job, the value is normalized by the number of applications run within the job. This reflects the situation that the user will be interested in getting things done for minimal charge.
Note: In the near future, a charge factor will be introduced for gen2 nodes which will scale actual node-hours to effective node hours, in a manner that levels the performance difference between the node generations. At present, no charge factor is applied.
combined objective tmax × charge → min
An empirical objective to minimize both time and charge. Since the charge is proportional to time, the time effectively enters quadratically.
perf → max
A measure for the performance per core. It should be proportional to the FLoating point Operations Per Second (FLOPS) achieved. Since the actual number of operations for the chosen workload is unknown, the value here is calculated using an arbitrary constant to produce convenient values:
   perf := 100000 * napps / tmax / charge_cores

Miscellaneous

run
a sequence number, used to identify the run and its various files and directories.
charge_cores
the number of cores requested from the queing system, i.e., those blocked from use by other users. The tests within the benchmark ran with the qsub option (lowercase "ell") -l naccesspolicy=singlejob.
HT
hyperthreading is in effect for this run

Results

Data

Files

Gen1 nodes

cores/app

run

nodes

ppn

napps

tmax

charge-
cores
charge raw
(core-h/app)
perf
(ops/s/core)
combined
objective
HT

4 1 1 4 1 1138.83 8 2.53 11.0 28.82
4 3 1 8 2 2488.86 8 2.77 10.0 68.83
6 11 1 6 1 1566.07 8 3.48 8.0 54.50
6 13 2 3 1 816.61 16 3.63 7.7 29.64
6 16 2 6 2 1401.48 16 3.11 8.9 43.65
8 21 1 8 1 1488.72 8 3.31 8.4 49.25
8 23 2 4 1 791.37 16 3.52 7.9 27.83
8 26 2 8 2 1494.81 16 3.32 8.4 49.65
12 31 2 6 1 838.08 16 3.72 7.5 31.22
12 33 3 4 1 544.74 24 3.63 7.6 19.78
12 35 4 3 1 503.05 32 4.47 6.2 22.49
12 41 3 8 2 1117.20 24 3.72 7.5 41.60
12 43 4 6 2 838.51 32 3.73 7.5 31.25
16 51 2 8 1 998.73 16 4.44 6.3 44.33
16 53 4 4 1 522.50 32 4.64 6.0 24.27
16 56 4 8 2 1626.13 32 7.23 3.8 117.52

Gen2 nodes

cores/app

run

nodes

ppn

napps

tmax

charge-
cores
charge raw
(core-h/app)
perf
(ops/s/core)
combined
objective
HT

4 2 1 4 1 516.69 8 1.15 24.2 5.93
4 4 1 8 2 767.26 8 0.85 32.6 6.54
6 12 1 6 1 470.10 8 1.04 26.6 4.91
6 14 2 3 1 447.01 16 1.99 14.0 8.88
6 15 1 12 2 867.12 8 0.96 28.8 8.35 HT
6 17 2 6 2 587.55 16 1.31 21.3 7.67
8 22 1 8 1 472.16 8 1.05 26.5 4.95
8 24 2 4 1 426.55 16 1.90 14.7 8.09
8 25 1 16 2 927.22 8 1.03 27.0 9.55 HT
8 27 2 8 2 596.30 16 1.33 21.0 7.90
12 30 1 12 1 565.44 8 1.26 22.1 7.10 HT
12 32 2 6 1 330.88 16 1.47 18.9 4.87
12 34 3 4 1 270.06 24 1.80 15.4 4.86
12 36 4 3 1 267.81 32 2.38 11.7 6.38
12 40 2 12 2 582.08 16 1.29 21.5 7.53 HT
12 42 3 8 2 383.44 24 1.28 21.7 4.90
12 44 4 6 2 321.16 32 1.43 19.5 4.58
16 50 1 16 1 592.26 8 1.32 21.1 7.79 HT
16 52 2 8 1 329.10 16 1.46 19.0 4.81
16 54 4 4 1 237.59 32 2.11 13.2 5.02
16 55 2 16 2 601.78 16 1.34 20.8 8.05 HT
16 57 4 8 2 327.90 32 1.46 19.1 4.78

Observations

  • For the same core constellation in a workload, gen2 nodes are 2...3 times faster than gen1 nodes – see last page in the analysis (PDF).
  • 4-core runs give the highest numerical throughput in each node type (run=01 to 04).
  • gen2 nodes are fine for VASP with nodes=1:ppn=8; gen1 nodes are not (run=22 vs. 21).

Hyperthreading and node-sharing

  • Naïve application of hyperthreading (HT) is detrimental - it leads to increased runtimes and thus increased charges.
    • run=50 (ppn=16), charge 26% higher than run=22 (ppn=8)
    • run=30 (ppn=12), charge 20% higher than run=22 (ppn=8)
  • However, when non-MPI jobs or two unsynced MPI jobs are running (run=15, 25, 17), HT yields 10...20% lower charge rates. This is a pittance and makes HT largely unattractive. Unsynced MPI jobs (i.e., sharing nodes) is mildly beneficial in all cases, whether hyperthreading is used or not (runs=04, 15, 25; see pg. 4 in the PDF. The best-case scenario gives a charge savings of -24% (run=15 vs. 30, at nodes=1:ppn=12). This is even the case on gen1 nodes (e.g., runs=26 vs. 51, at nodes=2:ppn=8, charge savings = -25%).
  • Running two apps in a single job is mostly not worth the effort of managing them.
  • Sharing nodes confers a mild benefit and should be rewarded in a charge discount.

Recommendations

For the given workload, the following values for optimal performance with respect to the given objective can be recommended:

Node type Objective
time → min charge → min time × charge → min
gen1 nodes=4:ppn=3:gen1 nodes=1:ppn=4:gen1 nodes=3:ppn=4:gen1
run=35
tmax=503.05
charge=4.47
run=01
tmax=1138.83
charge=2.53
run=33
tmax=544.74
charge=3.63
gen2 nodes=4:ppn=4:gen2 nodes=1:ppn=8:gen2 nodes=2:ppn=8:gen2
run=54
tmax=237.59
charge=2.11
run=22
tmax=472.16
charge=1.05
run=52
tmax=329.10
charge=1.46

The last column is an empirically formulated objective to minimize both time and charge. Compared to the minimum-charge runs (Objective columns 2, runs 01 and 22, respectively), adding nodes will reduce the runtime and only slighty increase the charge. The fastest runs (first Objective column) use the same number of cores for calculation, but since the number of nodes is higher, so is charge_cores, and thus the job charge is also higher.

Update

Adjusting VASP input

The VASP manual recommends NPAR to number of nodes (understood to be number of cores, since at the time single-core CPUs were all the rage). Nowadays, it would be better to set

NPAR = number of nodes

meaning the number of SMP nodes, which is the current state of the art.

(Credit: M. Chan).

Charge rates

Given the above benchmark results, the charge rate on gen1 nodes was set to 0.5 early on (soon after the gen2 nodes were put into production). This means that a task will incur approximately the same charge when run on either gen1 or gen2 nodes. Short tasks even have a certain advantage to run on gen1 nodes, since often the wait time there is lower, while the longer execution time matters less for the user (being small to begin with).

Charges in the benchmark above are raw, without this factor, i.e, charge cores wall time, where charge cores is the number of cores reserved by Moab.

--stern