HPC/Benchmarks/Generation 1 vs 2
Introduction
Earlier this year, we received 200 additional nodes with E5540 processors. The processors have 8 cores, and support Hyperthreading, a feature which allows 2 threads per core. This benchmark investigates the benefit of hyperthreading (HT), and suggests optimal values for the nodes and processors per node (ppn) parameters in PBS.
Parameters
- napps
- The number of applications run in parallel within the job. Typically, a user will run only one (MPI) application at a time. This benchmark allows to run more than one application in parallel, equally subdividing the available cores on any participant node via the OpenMPI
--npernode n
flag. Motivation: As a hypothesis, I considered it possible to be beneficial to run several (related or unrelated) applications on a processor, but not within the same MPI job. Different workloads would minimize the chance of congestion of the processors' pipelining architecture. - cores/app
- The number of cores that a single application workload is executed on. Typically, this value is used in studies of the parallel scaling (parallelization efficiency) of an application.
- gen
- The node hardware generation
- gen1 = Intel Xeon X5355, 2.66GHz, 8 cores per node, 16 GB RAM per node (2 GB/core)
- gen2 = Intel Xeon E5540, 2.53GHz, 8 cores per node, 24 GB RAM per node (3 GB/core), Hyperthreading enabled
- nodes
- Number of nodes requested from the queing system.
- ppn
- processors per node requested from the queing system.
Objective variables
- tmax
- The wallclock runtime of a job in seconds. This is the time metric used for calculating the job charge.
- charge
- The charge for the job, expressed in core-hours per application. Since the present test may use more than one application in a job, the value is scaled by the number of applications run within the job. This reflects the situation that the user will be interested in getting things done for minimal charge.
Note: In the near future, a charge factor will be introduced for gen2 nodes which will scale actual node-hours to effective node hours, in a manner that levels the performance difference between the node generations. At present, no charge factor is applied.
Miscellaneous
- run
- a sequence number, used to identify the run and its various files and directories.
- charge_cores
- the number of cores requested from the queing system, i.e., those blocked from use by other users. The tests within the benchmark ran with the qsub option (lowercase "ell")
-l naccesspolicy=singlejob
. - perf
- A measure for the performance per core. It should be proportional to the FLoating point Operations Per Second (FLOPS) achieved. Since the actual number of operations for the chosen workload is unknown, the value here is calculated using an arbitrary constant to produce convenient values:
perf := 100000 * napps / tmax / charge_cores
- HT
- hyperthreading is in effect for this run
Test description
The test runs /opt/soft/vasp-4.6.35-mkl-8/bin/vasp
with the following workload (Credit: D. Shin, Northwestern Univ.).
INCAR
SYSTEM = Al12Mg17 ISTART = 0 ISMEAR = 1 SIGMA = 0.1 ISIF = 3 PREC = HIGH IBRION = 2 LWAVE = .FALSE. LCHARG = .FALSE. LREAL = .TRUE. ENCUT = 346
KPOINTS
KPOINTS file 0 Monkhorst-Pack 10 10 10 0 0 0
POSCAR
Al12Mg17 1.0000000000 -5.2719000000 5.2719000000 5.2719000000 5.2719000000 -5.2719000000 5.2719000000 5.2719000000 5.2719000000 -5.2719000000 12 17 Direct 0.3679000000 0.3679000000 0.1908000000 Al 0.1771000000 0.1771000000 0.8092000000 Al 0.6321000000 0.8229000000 0.0000000000 Al 0.8229000000 0.6321000000 0.0000000000 Al 0.1908000000 0.3679000000 0.3679000000 Al 0.3679000000 0.1908000000 0.3679000000 Al 0.0000000000 0.8229000000 0.6321000000 Al 0.1771000000 0.8092000000 0.1771000000 Al 0.8092000000 0.1771000000 0.1771000000 Al 0.8229000000 0.0000000000 0.6321000000 Al 0.0000000000 0.6321000000 0.8229000000 Al 0.6321000000 0.0000000000 0.8229000000 Al 0.3975000000 0.3975000000 0.7164000000 Mg 0.6811000000 0.6811000000 0.2836000000 Mg 0.6025000000 0.3189000000 0.0000000000 Mg 0.3189000000 0.6025000000 0.0000000000 Mg 0.7164000000 0.3975000000 0.3975000000 Mg 0.3975000000 0.7164000000 0.3975000000 Mg 0.0000000000 0.3189000000 0.6025000000 Mg 0.6811000000 0.2836000000 0.6811000000 Mg 0.2836000000 0.6811000000 0.6811000000 Mg 0.3189000000 0.0000000000 0.6025000000 Mg 0.0000000000 0.6025000000 0.3189000000 Mg 0.6025000000 0.0000000000 0.3189000000 Mg 0.6480000000 0.6480000000 0.6480000000 Mg 0.0000000000 0.0000000000 0.3520000000 Mg 0.3520000000 0.0000000000 0.0000000000 Mg 0.0000000000 0.3520000000 0.0000000000 Mg 0.0000000000 0.0000000000 0.0000000000 Mg
POTCAR
PAW_GGA Al 05Jan2001 3.00000000000000000 parameters from PSCTR are: VRHFIN =Al: s2p1 LEXCH = 91 EATOM = 53.6910 eV, 3.9462 Ry ... PAW_GGA Mg 05Jan2001 2.00000000000000000 parameters from PSCTR are: VRHFIN =Mg: s2p0 LEXCH = 91 EATOM = 23.0823 eV, 1.6965 Ry ...
(abbreviated)
Results
Raw data
- HPC/Generation-2 nodes/vasp/vasp.lst, grep-able
- media:Vasp.txt, tab-separated CSV
- media:Vasp.pdf, PDF – This is an extensive analysis of all runs, with comparisons made by tmax, charge, and both. The last page contains a direct comparison of the performance of gen1 vs. gen2 nodes.
Observations
- For the same core constellation in a workload, gen2 nodes are 2...3 times faster than gen1 nodes.
- 4-core runs give the highest numerical throughput in each node type (run=01 to 04).
- gen2 nodes are fine for VASP with nodes=1:ppn=8; gen1 nodes are not (run=22 vs. 21).
Hyperthreading and node-sharing
- Naïve application of hyperthreading (HT) is detrimental - it leads to increased runtimes and thus increased charges.
- run=50 (ppn=16), charge 26% higher than run=22 (ppn=8)
- run=30 (ppn=12), charge 20% higher than run=22 (ppn=8)
- However, when non-MPI jobs or two unsynced MPI jobs are running (run=15, 25, 17), HT yields 10...20% lower charge rates. This is a pittance and makes HT largely unattractive. Unsynced MPI jobs (i.e., sharing nodes) is mildly beneficial in all cases, whether hyperthreading is used or not (runs=04, 15, 25; see pg. 4 in the PDF. The best-case scenario gives a charge savings of -24% (run=15 vs. 30, at nodes=1:ppn=12). This is even the case on gen1 nodes (e.g., runs=26 vs. 51, at nodes=2:ppn=8, charge savings = -25%).
- Conclusion:
- Running two apps in a single job is mostly not worth the effort of managing them.
- Sharing nodes confers a mild benefit and should be rewarded in a charge discount.
Recommendations
For the given workload, the following values for optimal performance with respect to the given objective can be recommended:
Node type | Objective | ||
---|---|---|---|
time → min | charge → min | time × charge → min | |
gen1 | nodes=4:ppn=3 |
nodes=1:ppn=4 |
nodes=3:ppn=4
|
run=35 tmax=503.05 charge=4.47 |
run=01 tmax=1138.83 charge=2.53 |
run=33 tmax=544.74 charge=3.63 | |
gen2 | nodes=4:ppn=4 |
nodes=1:ppn=8 |
nodes=2:ppn=8
|
run=54 tmax=237.59 charge=2.11 |
run=22 tmax=472.16 charge=1.05 |
run=52 tmax=329.10 charge=1.46 |
The last column is an empirically formulated objective to minimize both time and charge. Compared to the minimum-charge runs (01 and 22, respectively), adding nodes will reduce the runtime and only a slighty increase the charge. The fastest runs (first Objective column) use the same number of cores for calculation, but since the number of nodes is higher, so is charge_cores, and thus the job charge is also higher.
--stern