ARROW Cluster

From TRACC Wiki
Jump to navigation Jump to search

Introduction To ARROW

TRACC previously combined the original hardware from the Phoenix and Zephyr clusters into the ARROW cluster. To avoid the problems of load balancing, the different types of hardware nodes on the ARROW cluster are partitioned and available in queues. When new hardware is installed to expand cluster resources, it will be made available via a new queue. The documentation at Using the Cluster describes procedures for using ARROW. ARROW is arranged such that there is a single set of 4 login nodes, a singe file system, and single user home directory that serves all of the nodes in all of the queues.

We had been using currently Torque. Jobs can be submitted from any of the login nodes. Once a job starts, the nodes assigned to that user are generally accessible by additional ssh sessions from any other node in the system. For example, if you submit a job from login1, you can go to login2 and create an ssh session to the node that was handed out by the scheduler. Think of it as a global resource allocation that gives you access to a few nodes that you can do anything on as you desire until the job time expires. This is true for interactive and batch sessions, it’s all the same. Any node assigned to a user is fully allocated to that user, and a job can only ask for full nodes. No other users can share a node that has been handed out to a user. The queues are used to get certain CPU types for the job. So much about Torque and Maui.

We are now adding a second scheduler to ARROW called PBS (OpenPBS or PBS-PRO). We thus have two schedulers: Torque and PBS. We are operating all nodes on the cluster either under the old scheduler (Torque) or the new one (PBS). As an example we now have two nodes of GPUs. You can’t schedule either of the GPU nodes using the Torque scheduler but can from the PBS scheduler.  

The “module” command changes your environment variable to enable or disable certain software collections or components. That includes Python, OpemFoam, StarCCM+, and many more. The module command is now configured to be able to use the PBS scheduler for the current production system by using: Module load PBS/PBS That’s the new scheduler that is installed but not activated by default. It behaves roughly like the old scheduler, but is much more modern and flexible. Because it is installed but not activated, you have to turn off the standard scheduler and enable OpenPBS.

The syntax of the scheduler commands is a bit different allowing for more concise job specifications. By default, users can ask for specific queues, number of nodes, number of cores per node, number of gpus, and amount of memory. If space is available on a node already partially in use by somebody else, your job may start sharing the same node. Things like the wall time of concurrent jobs are independent, so think of each job to be executed within the resources specified at submission time. It’s not very different from the current system, but much more fine grained.  The operating system uses CGROUPS to keep resources from affecting each other. If your job starts using more cores than you asked for, the operating system will pack all your load onto the allocated cores. Therefore, your job cannot eat into the resources if other jobs tuning in the same machine. If your job starts to use more memory than allocated to you, your job dies. There may be a few ways users can mess with other user’s jobs, bit they would have to be hacking things. 

The environment modules software switches as well when using the new scheduler. Not all software will be available yet, and I am mostly using this for testing compatibility right now. 

ARROW Queues

There are currently several queues that are available, some with restrictions about who can use them as described below. Also be aware that all nodes in some queues have the same characteristics (RAM, etc) while some queues have nodes with different characteristics. Thus jobs using those queues must specify the node names that are to be used.

  • batch queue (default queue)
    • 95 nodes numbered n005 through n099
    • 2 x AMD Opteron 6276
    • 16 floating point cores per node
    • 32GB of RAM per node
    • available for general use
  • batch128 queue
    • 2 nodes numbered n001 and n002
    • Same design as batch queue
    • 128GB of RAM per node
    • available for general use
  • batch64 queue
    • 2 nodes numbered n003 and n004
    • Same design as batch queue
    • 64GB of RAM per node
    • available for general use
  • nhtsa queue
    • 12 nodes numbered p001 through p012
    • 2 x Intel Xeon E5-2690 v4
    • 28 floating point cores per node
    • 64GB of RAM per node
    • only available to the NHTSA project
  • arrow queue
    • 15 nodes numbered a001 through a015
    • 1 x Intel EPYC 7702P
    • 64 floating point cores per node
    • 256GB of RAM per node, 512GB on nodes a001 through a003
    • available for general use
  • extra queue
    • 12 nodes numbered a0016 through a027
    • 1 x Intel EPYC 7713P
    • 64 floating point cores per node
    • 256GB of RAM per node, 512GB on nodes a018 through a022
    • available for general use
    • note: this queue will likely be merged into the arrow queue in the future
  • virtual queue
    • 5 nodes numbered v001 through v005
    • Mostly for internal testing and validation, can be used as 2 core machines with 32GB memory
    • Minimal virtual hardware, not capable of running engineering applications