HPC/Hardware Details: Difference between revisions

Latest revision as of 18:33, January 23, 2023

Carbon Cluster
User Information

User nodes

Carbon has many major node generations, named genN for short, with N being an integer. In some generations, nodes differ further by the amount of memory.

Node Types

Node names, types	Node generation	Node extra properties	Node count	Cores per node (max. `ppn`)	Cores total, by type	Account charge rate	CPU model	CPUs per node	CPU nominal clock (GHz)	Mem. per node (GB)	Mem. per core (GB)	GPU model	GPU per node	VRAM per GPU (GB)	Disk per node (GB)	Year added
Login
login5…6	gen7a	gpus=2	2	16	32	1.0	Xeon Silver 4125	2	2.50	192	12	Tesla V100	2	32	250	2019
Compute
n421…460	gen5		40	16	640	1.0	Xeon E5-2650 v4	2	2.10	128	8				250	2017
n461…476	gen6		16	16	256	1.0	Xeon Silver 4110	2	2.10	96	6				1000	2018
n477…512	gen6		36	16	576	1.0	Xeon Silver 4110	2	2.10	192	12				1000	2018
n513…534	gen7	gpus=2	22	32	704	1.5	Xeon Gold 6226R	2	2.90	192	6	Tesla V100S	2	32	250	2020
n541…580	gen8		20	64	2560	1.0	Xeon Gold 6430	2	2.10	1024	16				420	2024
Total			134		4736								48

Compute time is charged as the product of cores reserved × wallclock time × charge rate. The charge rate accommodates nominal differences in CPU speed.
gen7 nodes have two GPUs each; GPU usage is currently not "charged" (accounted for) separately.
Virtual memory usage on nodes may reach up to about 2 × the physical memory size. Your processes running under PBS may allocate that much vmem but cannot practically use it all for reasons of swap space size and bandwidth. If a node acitvely uses swap for more than a few minutes (which drastically slows down compute performance), the job will automatically be killed.

Major CPU flags

CPU capabilities grow with each node generation. Executables can be compiled to leverage specific CPU capabilities. Jobs using such executables must use the qsub option -l nodes=...:genX to be directed to nodes having that capability.

Major CPU capability flags by node generation. For details, see: CPUID instruction in Wikipedia, a StackExchange article, or `/usr/src/kernels/*/arch/x86/include/asm/cpufeatures.h` in kernel sources.
Flag name	gen5	gen6	gen7	gen8
cat_l2 cdp_l2 cldemote gfni movdir64b movdiri pconfig sha_ni umip vaes vpclmulqdq	–	–	–	x
avx512_bitalg	–	–	–	x
avx512_vbmi2	–	–	–	x
avx512_vpopcntdq	–	–	–	x
avx512ifma	–	–	–	x
avx512vbmi	–	–	–	x
avx512_vnni	–	–	x	x
mpx	–	x	x	–
avx512bw	–	x	x	x
avx512cd	–	x	x	x
avx512dq	–	x	x	x
avx512f	–	x	x	x
avx512vl	–	x	x	x
art clwb flush_l1d ibpb mba md_clear ospke pku ssbd stibp tsc_deadline_timer xgetbv1 xsavec	–	x	x	x
3dnowprefetch abm acpi aes aperfmperf apic arat arch_perfmon bmi1 bmi2 bts cat_l3 cdp_l3 cmov constant_tsc cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cx16 cx8 dca de ds_cpl dtes64 dtherm dts eagerfpu epb ept erms est f16c flexpriority fpu fsgsbase fxsr hle ht ida invpcid invpcid_single lahf_lm lm mca mce mmx monitor movbe msr mtrr nonstop_tsc nopl nx pae pat pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni popcnt pse pse36 pts rdrand rdseed rdt_a rdtscp rep_good rsb_ctxsw rtm sdbg sep smap smep smx ss sse sse2 sse4_1 ssse3 syscall tm tm2 tpr_shadow tsc tsc_adjust vme vmx vnmi vpid x2apic xsave xsaveopt xtopology xtpr	x	x	x	x
avx	x	x	x	x
avx2	x	x	x	x
fma	x	x	x	x
adx	x	x	x	x
sse4_2	x	x	x	x

Storage

Lustre parallel file system for /home and /sandbox
≈600 TB total
local disk per compute node, 160–250 GB

Interconnect

Infiniband – used for parallel communication and storage
Gigabit Ethernet – used for general node access and management

Power

Power consumption at typical load: ≈125 kW

@@ Line 3: / Line 3: @@
 == User nodes ==
+<!--
 [[Image:HPC Compute Node Chassis.jpg|thumb|right|200px|[http://www.supermicro.com/products/nfo/1UTwin.cfm 1U Twin] node chassis ([http://www.supermicro.com/products/chassis/1U/808/SC808T-980V.cfm Supermicro]).]]
 [[Image:HPC Compute Rack-up.png|thumb|right|200px|]]
-<!-- * [http://www.supermicro.com/products/nfo/1UTwin.cfm "1U Twin"] by [http://www.supermicro.com/ Supermicro] -->
+-->
-* Carbon has several major hardware node types, named '''gen1''' through '''gen3'''.
+<!--
-* Node characteristics
+* [http://www.supermicro.com/products/nfo/1UTwin.cfm "1U Twin"] by [http://www.supermicro.com/ Supermicro]
-{{Template:Table of node types}}
+-->
-* All nodes are dual-socket (4 cores/CPU, 8 cores/node).
-* Compute time on gen1 nodes is charged at a 50% discount of walltime. Depending on cores used and memory throughput demanded, these nodes may actually be about ''on par'' with gen2 (low memory throughput) or up to about 2–3 times slower.
-== Infrastructure nodes ==
+{{Template:Section node types}}
-* 2 Management nodes
-* 2 Lustre MDS
+<!-- https://www.osc.edu/documentation/knowledge_base/out_of_memory_oom_or_excessive_memory_usage -->
-* 4 Lustre OSS
-* dual socket, quad core (Intel Xeon E5345, 2.33 GHz)
-* pairwise failover
 == Storage ==
-* [http://wiki.whamcloud.com/ Lustre] parallel file system
+* [http://wiki.whamcloud.com/ Lustre] parallel file system for /home and /sandbox
-* 42 TB effective (84 TB raw RAID-10)
+* ≈600 TB total
-* 2 [http://www.nexsan.com/ NexSAN SATAbeast]
+* local disk per compute node,  160–250 GB
-* 160–250 GB local disk per compute node
-* NFS
-** for user applications and cluster management
-** highly-available server based on [http://www.drbd.org/ DRBD]
 [[Image:HPC Infiniband-blue.png|thumb|right|200px|]]
 == Interconnect ==
-* Infiniband 4x DDR (Mellanox, ''onboard'' ) – MPI and Lustre
+* Infiniband – used for parallel communication and storage
-* Ethernet 1 GB/s – node access and management
+* Gigabit Ethernet – used for general node access and management
-* Ethernet 10 Gb/s – crosslinks and uplink
-* FibreChannel – storage backends
 == Power ==
-* UPS (2) – carries infrastructure nodes and network switches
+* Power consumption at typical load: ≈125 kW
-* PDUs – switched and metered
-* Power consumption at typical load: 118 kW
 [[Category:HPC|Hardware]]