Job Submission and Monitoring: Difference between revisions
| (42 intermediate revisions by the same user not shown) | |||
| Line 4: | Line 4: | ||
==Resource Summary View== | ==Resource Summary View== | ||
To get started, users can query the overall status of resources on the cluster. The "qsum" script will list all queues and nodes, as well as how many are offline, down, free, or assigned to users. This is a script developed by our team, and may need to be updated if something goes wrong. Please contact us if you experience any problems. | To get started, users can query the overall status of resources on the cluster. The '''"qsum"''' script will list all queues and nodes, as well as how many are offline, down, free, or assigned to users. This is a script developed by our team, and may need to be updated if something goes wrong. Please contact us if you experience any problems. | ||
Each queue groups a number of nodes together based on their hardware and software configurations. Nodes can be part of more than one queue, and there are other complex details that we are ignoring here for the purpose of keeping it simple. | Each queue groups a number of nodes together based on their hardware and software configurations. Nodes can be part of more than one queue, and there are other complex details that we are ignoring here for the purpose of keeping it simple. | ||
| Line 12: | Line 12: | ||
Here is a very brief summary of what each of the queues is, and how to use them efficiently: | Here is a very brief summary of what each of the queues is, and how to use them efficiently: | ||
; a4000: This is a queue that has three 16 | ; a4000: This is a queue that has three machines with 16 cores each; each of these machines is furthermore equipped with three A4000 GPUs. That makes a total of 9 A4000 GPUs available to users. Neither the GPUs nor the processors are particularly powerful these days, but they make for a good software development environment. The machines have 512GB of memory, which makes them a good platform for experimenting with GPU capabilities. | ||
; a6000: This is a queue that has only one 64 | ; a6000: This is a queue that has only one single machine with 64 cores total, and is equipped with four A6000 GPUs. The system can be upgraded to 8 A6000 GPUs if needed. This is a decent GPU machine that can take a solid workload. The machine has 750GB of memory, which makes for a good production platform. | ||
; amd16: This is a queue with many of our older AMD-based 16-core machines, each of which | ; amd16: This is a queue with many of our older AMD-based 16-core machines, each of which equipped with 32GB of memory. While individual machines are a bit outdated, they are all interconnected with Infiniband and can provide a solid production workload in multi-node jobs over MPI without blocking the more current (and thus expensive) systems. | ||
; epyc1/epyc2: These are 2 separate queues with slightly different performance characteristics. Each of the groups is interconnected with Infiniband to provide a platform for large and demanding software packages, such as LS-Dyna and StarCCM+. They have between | ; epyc1/epyc2: These are 2 separate queues with slightly different performance characteristics. Each of the groups is interconnected with Infiniband to provide a platform for large and demanding software packages, such as LS-Dyna and StarCCM+. They have between 256GB and 512GB of memory. Because licenses for these software packages (LS-Dyna and StarCCM+) are very expensive, these applications should use the two epyc queues for making optimum use of limited core licenses available to each package. | ||
; xeon28: This is a set of intermediate machines with 28 cores and 64GB of memory. They can be used for a variety of purposes, including MPI jobs and single node application software. | ; xeon28: This is a set of intermediate machines with 28 cores and 64GB of memory. They can be used for a variety of purposes, including MPI jobs and single node application software. | ||
; virtual: This is a set of nodes without MPI capabilities. They are virtual machines with 32GB each. They can be used for higher demand applications that would interfere | ; virtual: This is a set of nodes without MPI capabilities. They are virtual machines with 32GB each. They can be used for higher demand interactive applications that would interfere otherwise with other users on the login node machines. A user would submit interactive jobs to individual virtual machines and thus avoid any significant load on login nodes. | ||
===The Queue Summary Script (qsum)=== | ===The Queue Summary Script (qsum)=== | ||
| Line 30: | Line 30: | ||
AVAILABLE (1): lambda01 | AVAILABLE (1): lambda01 | ||
=============== amd16 ========================================================== | =============== amd16 ========================================================== | ||
Queue: "amd16" / nodes: | Queue: "amd16" / nodes: 98 / down: 28 / offline: 5 / busy: 4 / available: 61 | ||
DOWN (28): n012, n017, n018, n022, n024, n030, n034, n037, n038, n040 | |||
n041, n042, n056, n057, n060, n062, n064, n069, n071, n072 | |||
AVAILABLE ( | n076, n079, n080, n086, n087, n089, n093, n094 | ||
OFFLINE (5): n001, n002, n003, n004, n010 | |||
ac.xliu1 (4): n026, n027, n028, n029 | |||
AVAILABLE (61): n005, n006, n007, n008, n009, n011, n013, n014, n015, n016 | |||
n019, n020, n021, n023, n025, n031, n032, n033, n035, n036 | |||
n039, n043, n044, n045, n046, n047, n048, n049, n050, n051 | |||
n052, n053, n054, n055, n058, n059, n061, n063, n065, n066 | |||
n067, n068, n070, n073, n074, n075, n077, n078, n081, n082 | |||
n083, n084, n085, n088, n090, n091, n095, n096, n097, n098 | |||
n099 | |||
=============== epyc1 ========================================================== | =============== epyc1 ========================================================== | ||
Queue: "epyc1" / nodes: | Queue: "epyc1" / nodes: 27 / down: 1 / offline: 0 / busy: 10 / available: 16 | ||
DOWN (1): a011 | |||
ac.ge.w* (2): a004, a027 | |||
ac.razm* (2): a015, a016 | |||
ac.vpra* (4): a005, a006, a008, a009 | |||
msitek (2): a001, a002 | |||
AVAILABLE (16): a003, a007, a010, a012, a013, a014, a017, a018, a019, a020 | |||
a021, a022, a023, a024, a025, a026 | |||
=============== epyc2 ========================================================== | =============== epyc2 ========================================================== | ||
Queue: "epyc2" / nodes: 20 / down: 0 / offline: 0 / busy: | Queue: "epyc2" / nodes: 20 / down: 0 / offline: 0 / busy: 16 / available: 4 | ||
cbojano* (16): a028, a030, a031, a032, a033, a034, a035, a036, a037, a038 | |||
a039, a040, a044, a045, a046, a047 | |||
AVAILABLE (4): a029, a041, a042, a043 | |||
=============== virtual ======================================================== | =============== virtual ======================================================== | ||
Queue: "virtual" / nodes: 6 / down: 0 / offline: 0 / busy: 0 / available: 6 | Queue: "virtual" / nodes: 6 / down: 0 / offline: 0 / busy: 0 / available: 6 | ||
| Line 59: | Line 71: | ||
===qstat=== | ===qstat=== | ||
To find out out about the status of all running jobs on the cluster you can use the "qstat" command. Here is an example: | To find out out about the status of all running jobs on the cluster you can use the '''"qstat"''' command. Here is an example: | ||
<PRE> | <PRE> | ||
| Line 78: | Line 90: | ||
</PRE> | </PRE> | ||
The first column shows the job id, a unique identifier for all jobs ever submitted to the cluster. This job id is important when killing jobs, or for other actions you may need to take. | The first column shows the '''job id''', a unique identifier for all jobs ever submitted to the cluster. This job id is important when killing jobs, or for other actions you may need to take. | ||
The next column shows the name of the job script. If the column shows STDIN, it means that this is an interactive job where a user can enter commands in a terminal window. This is particularly useful for model and software development task where the application has to be started and killed repeatedly. | The next column shows the name of the job script. If the column shows '''STDIN''', it means that this is an '''interactive job''' where a user can enter commands in a terminal window. This is particularly useful for model and software development task where the application has to be started and killed repeatedly. | ||
The owner of the job is shown next. These are the user names of the various people using the cluster. | The owner of the job is shown next. These are the user names of the various people using the cluster. | ||
| Line 112: | Line 124: | ||
In this table, you can see how many nodes and cores are being used by each job. For example, job 3029 of the user "ley" shows that it is running on 2 nodes using a total of 128 cores. In addition to the elapsed time, the table also show the reserved time for this job. This allow you to estimate when a job will be definitely finalized (or killed by the system if still running). | In this table, you can see how many nodes and cores are being used by each job. For example, job 3029 of the user "ley" shows that it is running on 2 nodes using a total of 128 cores. In addition to the elapsed time, the table also show the reserved time for this job. This allow you to estimate when a job will be definitely finalized (or killed by the system if still running). | ||
The last column (without a header) is written because the option "- | The last column (without a header) is written because the option '''"-an1"''' was used. This useful to learn about which nodes are used by each job. | ||
===qstat -q=== | ===qstat -q=== | ||
To learn more about the queues on the cluster, the 'q' option turns out to be useful. | To learn more about the queues on the cluster, the '''"-q"''' option turns out to be useful. | ||
<PRE> | <PRE> | ||
| Line 136: | Line 148: | ||
</PRE> | </PRE> | ||
For each queue, some basic values are displayed. The first three queues listed above have a default memory allocation as shown, and the column "Node" indicates the maximum number of nodes that can be asked for at job submission time. For example, you can request just one node for a job from the first three queues (because these are queues where MPI makes no sense). The "xeon28" queue also for a maximum of 4 nodes per MPI job. The "amd16" queue has a maximum of 8 nodes per job, and the "epyc1" and "epyc2" queues have maxima of two nodes per job. These limitations can be changed by the administrator as needed. As shown above, | For each queue, some basic values are displayed. The first three queues listed above have a default memory allocation as shown, and the column '''"Node"''' indicates the maximum number of nodes that can be asked for at job submission time. For example, you can request just one node for a job from the first three queues (because these are queues where MPI makes no sense). The '''"xeon28"''' queue also for a maximum of 4 nodes per MPI job. The '''"amd16"''' queue has a maximum of 8 nodes per job, and the '''"epyc1"''' and '''"epyc2"''' queues have maxima of two nodes per job. These limitations can be changed by the administrator as needed. As shown above, this will prevent inefficient resource requests. | ||
===qstat -f=== | ===qstat -f=== | ||
The command '''"qstat -f -F json 3029"''' retrieves extremely detailed stats on the running job 3029. The result can be returned in JSON format to be ready for further processing (shown below). | |||
<PRE> | <PRE> | ||
$ qstat -f | $ qstat -f -F json 3029 | ||
{ | { | ||
"timestamp":1763705353, | "timestamp":1763705353, | ||
| Line 217: | Line 229: | ||
===Manual pages for qstat=== | ===Manual pages for qstat=== | ||
To learn more about the "qstat" command, you can use the command "man qstat", which will print a lot of detailed information about the capabilities of this command. | To learn more about the '''"qstat"''' command, you can use the command '''"man qstat"''', which will print a lot of detailed information about the capabilities of this command. | ||
<PRE> | <PRE> | ||
| Line 240: | Line 252: | ||
==Job Submission Basics== | ==Job Submission Basics== | ||
Jobs are submitted into the system using the "qsub" application. This application can take many different options and allows for a lot of different resource requests to tell the cluster what to do. We are running OpenPBS 23.06.06 as our scheduler. Here is a link to the User's Manual (of PBS PRO) if you want to explore gory details and capabilities. The User's Guide has about 240 pages, the Reference Guide has 500 pages, and the Big Book has 2500 pages. So there is a lot of information available. I also added job submission info for the LCRC cluster. | Jobs are submitted into the system using the '''"qsub"''' application. This application can take many different options and allows for a lot of different resource requests to tell the cluster what to do. We are running '''OpenPBS 23.06.06''' as our job scheduler. Here is a link to the User's Manual (of PBS PRO) if you want to explore gory details and capabilities. The User's Guide has about 240 pages, the Reference Guide has 500 pages, and the Big Book has 2500 pages. So there is a lot of information available. I also added job submission info for the LCRC cluster. | ||
* [https://argonne-lcrc.github.io/user-guides/running-jobs-at-lcrc/pbs-pro/ Argonne's LCRC pages on job submissions on their clusters] | * [https://argonne-lcrc.github.io/user-guides/running-jobs-at-lcrc/pbs-pro/ Argonne's LCRC pages on job submissions on their clusters] | ||
| Line 252: | Line 264: | ||
The big book is what I had to use when configuring OpenPBS earlier this year. This includes all the tricky details needed to make the system work smoothly for us. It's a bit scary to look at a PDF file that is 2500 pages long, but that is nothing compared to the StarCCM+ manuals. | The big book is what I had to use when configuring OpenPBS earlier this year. This includes all the tricky details needed to make the system work smoothly for us. It's a bit scary to look at a PDF file that is 2500 pages long, but that is nothing compared to the StarCCM+ manuals. | ||
<BLOCKQUOTE> | |||
[[File:Attention.jpg|25px]] '''IMPORTANT NOTE:''' ''The following sections are important to understand. They explain how jobs are submitted and then scjeduled for execution based on resources available and the specific need of the user.'' | |||
</BLOCKQUOTE> | |||
The following sections explain the various tasks you may want to submit fir execution, ordered from simple to complex. | |||
* General Batch Jobs | |||
** Requesting a Single Node for a Job | |||
** Requesting Multiple Nodes for a Job | |||
* Embedded Job Resource Requests | |||
* Interactive Jobs | |||
* Interactive Jobs with X-Windows GUI Applications | |||
* Running Multiple Jobs on Single Nodes | |||
* Running Jobs using GPUs | |||
===General Batch Jobs=== | ===General Batch Jobs=== | ||
Let's get started with a very basic usage of the system. Let's assume you have a simple application, and you want to execute it on a cluster node. Let's also assume that this is a very simple application, one that runs on one or | Let's get started with a very basic usage of the system. Let's assume you have a simple application, and you want to execute it on a cluster node. Let's also assume that this is a very simple application, one that runs on one or a few cores, doesn't require any keyboard interaction with the user, doesn't need the user to see what's typically written to the screen, and writes its output to files. In this case, we can submit this application as a batch job, which will place it into an execution queue and process it as soon as a node becomes available. | ||
If the application requires more cores than a single node can provide, we can run the application over Infiniband with MPI message passing. In this case, we need to understand the concept of MPI applications a bit better. In both cases, we get started by creating a folder on the file system. Naming conventions are important so that you can distinguish the jobs by folder. | If the application requires more cores than a single node can provide, we can run the application over Infiniband with MPI message passing. In this case, we need to understand the concept of MPI applications a bit better. In both cases, we get started by creating a folder on the file system. Naming conventions are important so that you can distinguish the jobs by folder name. | ||
For both of the above scenarios, you would typically create a Bash shell script, and then submit the script into one of the queues for eventual execution. | For both of the above scenarios, you would typically create a Bash shell script, and then submit the script into one of the queues for eventual execution. | ||
====Requesting a Single Node for a | ====Requesting a Single Node for a Job==== | ||
Let's try something rather trivial to get used to the concept. Create yourself a folder, for example "myjobfolder". Within that folder, create a job submission script. That script can have any name, but something short and simple may be best. Let's assume you create a file called "cluster.job". The file doesn't have to have that extension. Any file name will do. But using the same filename for all of your jobs helps finding your way around the many files that will be created over time. The "cluster.job" file should look something like this: | Let's try something rather trivial to get used to the concept. Create yourself a folder, for example '''"myjobfolder"'''. Within that folder, create a job submission script. That script can have any name, but something short and simple may be best. Let's assume you create a file called '''"cluster.job"'''. The file doesn't have to have that extension. Any file name will do. But using the same filename for all of your jobs helps finding your way around the many files that will be created over time. The '''"cluster.job"''' file should look something like this: | ||
<PRE> | <PRE> | ||
| Line 309: | Line 309: | ||
</PRE> | </PRE> | ||
This can be submitted without detailed resource specifications: | This can be submitted without detailed resource specifications (except for the walltime, which is be default 0:00:00): | ||
<PRE> | <PRE> | ||
$ qsub -q virtual -l walltime=1: | $ qsub -q virtual -l walltime=1:00:00 cluster.job | ||
3072.pbs | 3072.pbs | ||
</PRE> | </PRE> | ||
Wait a little, | Wait a little, then check the status of running jobs: | ||
<PRE> | <PRE> | ||
| Line 335: | Line 335: | ||
</PRE> | </PRE> | ||
In this particular example, we are sending this job to the queue "virtual". This queue, by default, allocates 30GB of memory to the job, and runs | In this particular example, we are sending this job to the '''queue "virtual"'''. This queue, by default, allocates 30GB of memory to the job, and runs on 1 node with 4 cores. This is sufficient capacity to run quite a workload. When submitting a job to a single node, '''reasonable maximum allocations are automatically assigned''', and the user doesn't have to worry about running out of memory or how many cores he will be using. | ||
The only required argument is the "walltime" argument. By default, the job will quit as soon as it is submitted. This indicates to the user that he forgot to provide the "walltime" argument. | The only required argument is the '''"walltime"''' argument. By default, the job will quit as soon as it is submitted. This indicates to the user that he forgot to provide the '''"walltime"''' argument. | ||
When the job disappears from the job list, it is done. At this point, you will find the file "info.log" in your job folder. | When the job disappears from the job list, it is done. At this point, you will find the file "info.log" in your job folder. | ||
| Line 346: | Line 346: | ||
</PRE> | </PRE> | ||
====Requesting Multiple Nodes for a | ====Requesting Multiple Nodes for a Job==== | ||
To run jobs on multiple nodes, you will be likely executing jobs using MPI, the message passing interface. This | To run jobs on multiple nodes, you will be likely '''executing jobs using MPI''', the message passing interface. This establishes high-speed low-latency interconnections between the cores on one machine and the cores on the other machines. Data transfer does not require involvement of the cores themselves. Instead, the core tell the InfiniBand interconnect (and cores on the same node through shared memory) to transfer the data through RDMA, remoted direct memory access. The cores don't need to spend CPU cycles on copying data, but rather simply access the data once it has been copied by the Infiniband fabric. This makes for extremely efficient remote memory access, and message passing is used to coordinate data transfer between the cores no matter where they are located on any of the nodes. | ||
On our cluster, MPI-aware applications like OpenFOAM, StarCCM+, and LS-Dyna can be loaded as modules, which then automatically selects the most appropriate MPI library to use. The software applications have been tested to ensure that they work out-of-the box if a user selects any specific version of any of the applications. | On our cluster, MPI-aware applications like '''OpenFOAM''', '''StarCCM+''', and '''LS-Dyna''' can be loaded as modules, which then automatically selects the most appropriate MPI library to use. The software applications have been tested to ensure that they work out-of-the box if a user selects any specific version of any of the applications. | ||
The following is a very trivial example for the MPI execution of a very simple executable, with one copy running on each core of the nodes allocated to the job. It doesn't perform any real work and just wastes resources for a short time, but it illustrates how execution on the cores of various nodes works. | The following is a very trivial example for the MPI execution of a very simple executable, with one copy running on each core of the nodes allocated to the job. It doesn't perform any real work and just wastes resources for a short time, but it illustrates how execution on the cores of various nodes works. | ||
Like in the previous section, we start with a simple job script that we submit to an appropriate queues. In this case, we pick a queue that has machines with Infiniband interfaces supporting efficient communications. Let's assume we edit a file with the name "parallel.job" like this: | Like in the previous section, we start with a simple job script that we submit to an appropriate queues. In this case, we pick a queue that has machines with Infiniband interfaces supporting efficient communications. Let's assume we edit a file with the name '''"parallel.job"''' like this: | ||
<PRE> | <PRE> | ||
| Line 386: | Line 386: | ||
</PRE> | </PRE> | ||
A good queue to test | A good queue to test scripts is the '''"xeon28"''' queue. In the queue, we have 2 14-core Xeon processers per node, so that means that each node has 56 actual cores. We do not consider hyperthreading when doing parallel computing. 56 actual cores is what's being used here. The job submission will look like this: | ||
<PRE> | <PRE> | ||
| Line 416: | Line 416: | ||
</PRE> | </PRE> | ||
In this simple example, the lines look all the same. Upon close examination through, you can find slightly different values for some of the lines. Some lines say that the machine is up for 23 days and 9:28, while others say 23 days and 9:53. Because all 28 cores of a node would see the same uptime of the server, half of the entries show one time stamp, and the other 28 cores show the other one. That demonstrates | In this simple example, the lines look all the same. Upon close examination through, you can find slightly different values for some of the lines. Some lines say that the machine is up for 23 days and 9:28, while others say 23 days and 9:53. Because all 28 cores of a node would see the same uptime of the server, half of the entries show one time stamp, and the other 28 cores show the other one. That demonstrates that the 56 processes have been running independently on 2 nodes. | ||
===Embedded Job Resource Requests=== | ===Embedded Job Resource Requests=== | ||
The job script can be modified to embed the resource requests in form of a series of #PBS statements at the beginning of the script file. This is a very common practice use at many HPC installations and job submission engines. Let's go back to the previous example where we run the script on two nodes in parallel. That is the "parallel.job" script file again: | The job script can be modified to embed the resource requests in form of a series of '''#PBS''' statements at the beginning of the script file. This is a very common practice use at many HPC installations and job submission engines. Let's go back to the previous example where we run the script on two nodes in parallel. That is the '''"parallel.job"''' script file again: | ||
<PRE> | <PRE> | ||
| Line 480: | Line 480: | ||
</PRE> | </PRE> | ||
I leave this to you as an exercise to figure out what the various options mean and how to specify them. There are many more, all documented in the manual. Most of them are not terribly relevant and can be omitted. | I leave this to you as an exercise to figure out what the various options mean and how to specify them. There are many more, all documented in the PBS PRO manual (see above). Most of them are not terribly relevant and can be omitted. | ||
===Interactive Jobs=== | ===Interactive Jobs=== | ||
On ARROW, we don't restrict queues to be used only in batch mode. While '''batch mode''' is efficient for lining up a lot of work to be executed one after the other, ARROW has been designed to '''allow efficient model and software development in interactive mode'''. We have always ensured to have more computers than minimally needed to make it possible to dedicate resources to developers as needed, even if that means wasted CPU cycles. At times, we may ask you to limit the number of interactive jobs so that a large batch workload can be processed efficiently. This happens from time to time, and we have our users coordinate this with each other. | |||
Let's assume that you are developing an MPI application, or you are working on a complex OpenFOAM model that requires to start parallel processes over and over again just to find a bug and | Let's assume that you are developing an MPI application, or you are working on a complex '''OpenFOAM''' model that requires to start parallel processes over and over again just to find a bug and then fix it quickly. To do that, you can '''request an interactive job''' by adding the '''"-I"''' option to the job submission command (this is an uppercase I). Let's go to the parallel multi-node example from above: | ||
<PRE> | <PRE> | ||
| Line 501: | Line 501: | ||
</PRE> | </PRE> | ||
When | When running interactive jobs with the '''"-I"''' parameter, we don't specify av job script at the end of the submission command. The interactive job will instead start (once the nodes are available) in interactive mode, meaning that the terminal session changes over from being a series of commands executed on the login server to being a series of commands being executed on the first node of the group of nodes that are allocated to the job. At this point, you can change to the desired working directory, but what you do with the allocated resources is entirely up to you. You can load modules, including MPI libraries, and then issue the commands for your application interactively and see how they execute. If you start an '''"mpirun"''', the cores on your allocated secondary node will work as expected. There is no difference to batch mode, other than you having the ability to execute lines of commands at will. | ||
===Interactive Jobs with X-Windows GUI Applications=== | ===Interactive Jobs with X-Windows GUI Applications=== | ||
Interactive use can go further than that. With some of our software applications, like StarCCM+, you can run an interactive GUI application where you control the computational work from within | Interactive use can go further than that. With some of our software applications, like '''StarCCM+''', you can run an '''interactive GUI application''' where you control the computational work from within the applications' GUI. Within the GUI, you can control execution of the numerical solver that runs on as many cores as you requested, while being able to reconfigure the case through the GUI as well. Furthermore, you can visualize developing results on the fly by creating complex plots and visualizations. | ||
All that is need is an option "-X" being used as part of the job submission, like this: | All that is need is an option '''"-X"''' being used as part of the job submission, like this: | ||
<PRE> | <PRE> | ||
| Line 525: | Line 525: | ||
===Running Multiple Jobs on Single Nodes=== | ===Running Multiple Jobs on Single Nodes=== | ||
A feature that is new on ARROW is the ability to run multiple jobs on a single node. Let's assume that you are performing a sensitivity analysis on an existing model, and the model is simple enough to return results within a reasonable time on just a few cores of a higher end machine. Our high end machines have 64 cores, so lets assume you have an LS-Dyna model that runs well on 8 cores and doesn't use a whole lot of memory. In this case, you can submit individual jobs that request simply 8 cores and a fraction of the available memory, and all execute independently from each other. Each job is fit into a slot where available. It is not very different from using whole nodes for everything. The important consideration is that each job is cleanly constrained into it's allotted resources using the CGROUPS functionality of modern operating systems. Because an abusive user cannot use more cores or more memory than allocated to his job, other users can safely run smaller jobs on the same node. | A feature that is new on ARROW is the ability to run multiple jobs on a single node. Let's assume that you are performing a sensitivity analysis on an existing model, and the model is simple enough to return results within a reasonable time on just a few cores of a higher end machine (maybe you are running SMP versions of '''LS-Dyna'''). Our high end machines have 64 cores, so lets assume you have an '''LS-Dyna''' model that runs well on 8 cores and doesn't use a whole lot of memory. In this case, you can submit individual jobs that request simply 8 cores and a fraction of the available memory available on the node, and all jobs execute independently from each other. Each job is fit into a slot where available. It is not very different from using whole nodes for everything. The important consideration is that each job is cleanly constrained into it's allotted resources using the '''CGROUPS''' functionality of modern operating systems. Because an abusive user cannot use more cores or more memory than allocated to his job, other users can safely run smaller jobs on the same node. | ||
Lets assume that we have a number of smaller jobs that we want to run on a single node in the | Lets assume that we have a number of smaller jobs that we want to run on a single node in the '''"xeon28"''' queue. Each job would be submitted by using reduced resources that allow for sharing but that guarantee that the jobs will be run successfully. In this case, you can '''submit many jobs''' in the following manner (with a job script for the small jobs, each of which can '''request varying resources''' if needed - some may want to run on 5 cores, others on 3): | ||
<PRE> | <PRE> | ||
| Line 540: | Line 540: | ||
sleep 300 | sleep 300 | ||
# | # | ||
</ | </PRE> | ||
Now we submit a variety of these jobs (11 total) to the | Now we submit a variety of these jobs (11 total in this example) to the '''"xeon28"''' queue for execution (note that the first few jobs request different amounts of memory and core counts): | ||
<PRE> | <PRE> | ||
| Line 556: | Line 556: | ||
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job | qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job | ||
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job | qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job | ||
/PRE> | </PRE> | ||
They are now running in the order of submission, allocated on as few nodes in the "xeon28" queue as necessary. Only 2 nodes are being loaded quite heavily, and 4 more cores are in use on a third node. | They are now running in the order of submission, allocated on as few nodes in the "xeon28" queue as necessary. Only 2 nodes are being loaded quite heavily, and 4 more cores are in use on a third node. | ||
| Line 585: | Line 585: | ||
===Running Jobs using GPUs=== | ===Running Jobs using GPUs=== | ||
The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with GPUs for ML/AI applications. The cluster doesn't have a whole lot of GPUs at this point. We have three machines with three A4000 GOUs, a total of 9 GPUs. Then we have a much more powerful single machine with our A6000 GPUs. | The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with '''GPUs''' for '''ML/AI''' applications. The cluster doesn't have a whole lot of '''GPUs''' at this point. We have three machines with three '''A4000''' GOUs, a '''total of 9 A4000 GPUs'''. Then we have a much more powerful single machine with our '''four A6000 GPUs'''. | ||
Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast NVLINK between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and | Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast '''NVLINK''' between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and GPUs that have to jump between processors when communicating, and then the whole issue of having to go possibly through PCIe bridges. | ||
On our system, we are providing the ability to work mostly with individual GPUs. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time. | On our system, we are providing the ability to '''work mostly with individual GPUs'''. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time. | ||
<PRE> | <PRE> | ||
| Line 595: | Line 595: | ||
</PRE> | </PRE> | ||
With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the | With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the reserved GPU. | ||
To run a massive GPU job on 64 cores with 4 A6000 GPUs, submit the job like this: | To run a massive GPU job on 64 cores with 4 '''A6000 GPUs''', submit the job like this: | ||
<PRE> | <PRE> | ||
| Line 637: | Line 637: | ||
===Currently Available LS-Dyna Versions=== | ===Currently Available LS-Dyna Versions=== | ||
The following is a list of '''LS-Dyna versions''' available on '''ARROW''' after the latest reconfiguration of the system. | The following is a list of '''LS-Dyna versions''' available on '''ARROW''' after the latest reconfiguration of the system. As per LSTC/ANSYS, '''versions before 14.0.0 are not necessarily fully supported any longer''' because they are supposedly not compatible with modern operating systems and cannot be made to work reliably. We have tested the listed older versions of LS-Dyna and they have passed basic tests. They may not behave exactly as they did on the old CentOS 7 operating system, and time will show whether they can still be used or whether you will need to update your models and use a fully supported version. | ||
All versions are loaded using the '''"module load"''' command. Versions can be listed with the '''"module avail ls-dyna"''' command. To load one of the modules, use the following syntax: | All versions are loaded using the '''"module load"''' command. Versions can be listed with the '''"module avail ls-dyna"''' command. To load one of the modules, use the following syntax: | ||
| Line 668: | Line 668: | ||
$ module avail ls-dyna | $ module avail ls-dyna | ||
--------------------------------------------- /shared/apps/modulefiles --------------------------------------------- | --------------------------------------------- /shared/apps/modulefiles --------------------------------------------- | ||
ls-dyna/ | ls-dyna/09.3.1/1-d8-ifort131 ls-dyna/12.2.1/mpi-f4-ifort160-sse2 ls-dyna/14.2.0/mpi-f4-ifort190-avx512 | ||
ls-dyna/ | ls-dyna/09.3.1/1-f4-ifort131 ls-dyna/12.2.2/1-d8-ifort160-sse2 ls-dyna/14.2.0/mpi-f4-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/09.3.1/mpi-d8-ifort131-avx2 ls-dyna/12.2.2/1-f4-ifort160-sse2 ls-dyna/15.0.2/1-d8-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/09.3.1/mpi-d8-ifort131-avx512 ls-dyna/12.2.2/mpi-d8-aocc400-avx2 ls-dyna/15.0.2/1-f4-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/09.3.1/mpi-f4-ifort131-avx2 ls-dyna/12.2.2/mpi-d8-ifort160-avx2 ls-dyna/15.0.2/mpi-d8-aocc400-avx2 | ||
ls-dyna/ | ls-dyna/09.3.1/mpi-f4-ifort131-avx512 ls-dyna/12.2.2/mpi-d8-ifort160-sse2 ls-dyna/15.0.2/mpi-d8-ifort190-avx2 | ||
ls-dyna/ | ls-dyna/10.2.0/1-d8-ifort160 ls-dyna/12.2.2/mpi-f4-aocc400-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-avx512 | ||
ls-dyna/ | ls-dyna/10.2.0/1-f4-ifort160 ls-dyna/12.2.2/mpi-f4-ifort160-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/11.0.0/1-d8-ifort160 ls-dyna/12.2.2/mpi-f4-ifort160-sse2 ls-dyna/15.0.2/mpi-f4-aocc400-avx2 | ||
ls-dyna/ | ls-dyna/11.0.0/1-f4-ifort160 ls-dyna/13.0.0/1-d8-ifort190 ls-dyna/15.0.2/mpi-f4-ifort190-avx2 | ||
ls-dyna/ | ls-dyna/11.1.0/1-d8-ifort160-sse2 ls-dyna/13.0.0/1-f4-ifort190 ls-dyna/15.0.2/mpi-f4-ifort190-avx512 | ||
ls-dyna/ | ls-dyna/11.1.0/1-f4-ifort160-sse2 ls-dyna/13.0.0/mpi-d8-ifort190-avx2 ls-dyna/15.0.2/mpi-f4-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/11.2.0/1-d8-ifort160 ls-dyna/13.0.0/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/1-d8-aocc420-avx2 | ||
ls-dyna/ | ls-dyna/11.2.0/1-f4-ifort160 ls-dyna/13.0.0/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/1-d8-aocc420-avx512 | ||
ls-dyna/ | ls-dyna/11.2.0/mpi-f4-ifort160-avx2 ls-dyna/13.0.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/1-d8-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/11.2.0/mpi-f4-ifort160-sse2 ls-dyna/13.1.0/mpi-d8-aocc310-avx2 ls-dyna/16.0.0/1-f4-aocc420-avx2 | ||
ls-dyna/ | ls-dyna/11.2.1/1-d8-ifort160 ls-dyna/13.1.0/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/1-f4-aocc420-avx512 | ||
ls-dyna/ | ls-dyna/11.2.1/1-f4-ifort160 ls-dyna/13.1.0/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/1-f4-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/11.2.1/mpi-d8-ifort160-avx2 ls-dyna/13.1.0/mpi-f4-aocc310-avx2 ls-dyna/16.0.0/mpi-d8-aocc420-avx2 | ||
ls-dyna/ | ls-dyna/11.2.1/mpi-d8-ifort160-sse2 ls-dyna/13.1.0/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-aocc420-avx512 | ||
ls-dyna/ | ls-dyna/11.2.1/mpi-f4-ifort160-avx2 ls-dyna/13.1.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-d8-ifort190-avx2 | ||
ls-dyna/ | ls-dyna/11.2.1/mpi-f4-ifort160-sse2 ls-dyna/13.1.1/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-ifort190-avx512 | ||
ls-dyna/ | ls-dyna/11.2.2/mpi-d8-ifort160-avx2 ls-dyna/13.1.1/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/mpi-d8-ifort190-sse2 | ||
ls-dyna/ | ls-dyna/11.2.2/mpi-d8-ifort160-sse2 ls-dyna/13.1.1/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/mpi-f4-aocc420-avx2 | ||
ls-dyna/ | ls-dyna/11.2.2/mpi-f4-ifort160-avx2 ls-dyna/13.1.1/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-f4-aocc420-avx512 | ||
ls-dyna/ | ls-dyna/11.2.2/mpi-f4-ifort160-sse2 ls-dyna/14.0.0/1-d8-ifort190 ls-dyna/16.0.0/mpi-f4-ifort190-avx2 | ||
ls-dyna/12. | ls-dyna/12.1.0/1-d8-ifort160 ls-dyna/14.0.0/1-f4-ifort190 ls-dyna/16.0.0/mpi-f4-ifort190-avx512 | ||
ls-dyna/12. | ls-dyna/12.1.0/1-f4-aocc310 ls-dyna/14.0.0/mpi-d8-aocc310-avx2 ls-dyna/16.0.0/mpi-f4-ifort190-sse2 | ||
ls-dyna/12. | ls-dyna/12.1.0/1-f4-ifort160 ls-dyna/14.0.0/mpi-d8-ifort190-avx2 ls-dyna/16.1.0/mpi-d8-aocc420-avx2 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-d8-aocc310-avx2 ls-dyna/14.0.0/mpi-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-aocc420-avx512 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-d8-ifort160-avx2 ls-dyna/14.0.0/mpi-f4-ifort190-avx2 ls-dyna/16.1.0/mpi-d8-ifort190-avx2 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-d8-ifort160-sse2 ls-dyna/14.0.0/mpi-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-ifort190-avx512 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-f4-aocc310-avx2 ls-dyna/14.1.0/1-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-ifort190-sse2 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-f4-ifort160-avx2 ls-dyna/14.1.0/1-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-f4-aocc420-avx2 | ||
ls-dyna/12. | ls-dyna/12.1.0/mpi-f4-ifort160-sse2 ls-dyna/14.1.0/mpi-d8-aocc400-avx2 ls-dyna/16.1.0/mpi-f4-aocc420-avx512 | ||
</PRE> | </PRE> | ||
| Line 715: | Line 708: | ||
<BLOCKQUOTE> | <BLOCKQUOTE> | ||
[[File:Attention.jpg|25px]] '''IMPORTANT NOTE:''' ''The job/queue manager can track the number of LS-Dyna licenses to some degree. If all '''LS-Dyna users''' cooperate and use a script like the one shown below when submitting their jobs, the total number of concurrent '''LS-Dyna licenses''' will be tracked by the job manager correctly. That means that users can submit any number of LS-Dyna jobs, and jobs will only start when a sufficient number of licenses is available. This is managed by the '''"dynalic"''' resource at the end of the select statement. In this example, a 2-node job on 64-core nodes will need a total of '''"dynalic=128"''' licenses. This accounting breaks down when users don't use the '''"dynalic=XXX"''' statement, or when they don't calculate the number of licenses correctly. In that case, LS-Dyna jobs of all users are subject to sudden failure when LS-Dyna licenses run out. Please understand the importance of this specific setting in your job.'' | [[File:Attention.jpg|25px]] '''IMPORTANT NOTE:''' The job/queue manager can now track the number of LS-Dyna licenses given out to individual | ||
jobs. At submission time, it is not possible to know what software a user may run. But by adding the clause "-l dynalic" at submission time, | |||
the queue manager can calculate the total number of cores required and keep track of LS-Dyna licenses used by the job. When loading a version of LS-Dyna, a check will be performed, and LS-Dyna will be prevented from running if the "-l dynalic" clause was not used when submitting the job.'' | |||
<!-- | |||
''The job/queue manager can track the number of LS-Dyna licenses to some degree. If all '''LS-Dyna users''' cooperate and use a script like the one shown below when submitting their jobs, the total number of concurrent '''LS-Dyna licenses''' will be tracked by the job manager correctly. That means that users can submit any number of LS-Dyna jobs, and jobs will only start when a sufficient number of licenses is available. This is managed by the '''"dynalic"''' resource at the end of the select statement. In this example, a 2-node job on 64-core nodes will need a total of '''"dynalic=128"''' licenses. This accounting breaks down when users don't use the '''"dynalic=XXX"''' statement, or when they don't calculate the number of licenses correctly. In that case, LS-Dyna jobs of all users are subject to sudden failure when LS-Dyna licenses run out. Please understand the importance of this specific setting in your job.'' | |||
--> | |||
</BLOCKQUOTE> | </BLOCKQUOTE> | ||
Furthermore, careful consideration should be given with regards to choice of resources for an '''LS-Dyna job'''. With 64 cores available on a single node in the '''"epyc1"''' and '''"epyc2"''' queues, it may be counterproductive to run a job on two nodes instead of a single node. Users should run their jobs with different numbers of nodes and determine whether performance increases. It may well decrease when running a job on two or more nodes. The outcome of such tests will tell what the best allocation of resources will be. | Furthermore, careful consideration should be given with regards to choice of resources for an '''LS-Dyna job'''. With 64 cores available on a single node in the '''"epyc1"''' and '''"epyc2"''' queues, it may be counterproductive to run a job on two nodes instead of a single node. Users should run their jobs with different numbers of nodes and determine whether performance increases. It may well decrease when running a job on two or more nodes. The outcome of such tests will tell what the best allocation of resources will be. | ||
Most | Most users use a job script like the following. All methods for job submission the the previous chapters apply as well, so there is a lot of flexibility: | ||
<PRE> | <PRE> | ||
| Line 727: | Line 726: | ||
#PBS -q epyc1 | #PBS -q epyc1 | ||
#PBS -l walltime=12:0:0 | #PBS -l walltime=12:0:0 | ||
#PBS -l select=2:ncpus=64:mpiprocs=64:mem=225G | #PBS -l select=2:ncpus=64:mpiprocs=64:mem=225G,dynalic | ||
#PBS -N JobName | #PBS -N JobName | ||
#PBS -e log.error | #PBS -e log.error | ||
| Line 737: | Line 736: | ||
module load dynamore/current | module load dynamore/current | ||
# | # | ||
mpirun ls-dyna i=main.k memory1=300m memory2=100m | mpirun ls-dyna i=main.k memory1=300m memory2=100m > dyna.log | ||
# | # | ||
# when using the Dynamore tools, you can start something like this at the end | # when using the Dynamore tools, you can start something like this at the end | ||
DM.plotcprs.lnx -merge | DM.plotcprs.lnx -merge >> dyna.log | ||
# | # | ||
</PRE> | </PRE> | ||
===LSTC Tools: LS-OPT and LS-PREPOST=== | ===LSTC Tools: LS-OPT and LS-PREPOST=== | ||
For the new Rocky 9 cluster, I have not looked deeply into the ls-opt and ls-prepost packages that were installed. I noticed though that the LSTC server provided access to much newer versions of both software packages. If you would like to learn more or have a specific version in mind, I can easily download and install it for you. | |||
<PRE> | |||
$ module avail ls-opt | |||
----------------------------------------------- /shared/apps/modulefiles ------------------------------------------------ | |||
ls-opt/5.1.1 ls-opt/6.0.0 ls-opt/7.0.0 ls-opt/7.0.2 ls-opt/2022R2 | |||
ls-opt/5.2.1 ls-opt/6.1.0 ls-opt/7.0.1 ls-opt/2022R1 ls-opt/2023R1 | |||
</PRE> | |||
To start the software, type: | |||
lsopt | |||
<PRE> | |||
$ module avail ls-prepost | |||
----------------------------------------------- /shared/apps/modulefiles ------------------------------------------------ | |||
ls-prepost/4.5.10 ls-prepost/4.8.13 ls-prepost/4.8.30 ls-prepost/4.9.16 ls-prepost/4.10.7 | |||
</PRE> | |||
To start the software, type: | |||
lsprepost | |||
===Dynamore Software=== | ===Dynamore Software=== | ||
The Dynamore tools are available as a module: | |||
module load dynamore/current | |||
We typically acquire a yearly license for the tools as we purchase licenses for LS-Dyna. | |||
===Vendor License File Installation=== | ===Vendor License File Installation=== | ||
If you would like for us to install a vendor license for LS-Dyna models, please contact us for the required information. We can send you the general LS-Dyna license file to show the host ids for the license server. Using that information, your vendor should be able to create a vendor license file. Please send that file to us per Email or by other means. | |||
==StarCCM+ on the ARROW Cluster== | ==StarCCM+ on the ARROW Cluster== | ||
===Currently Available StarCCM+ Versions=== | ===Currently Available StarCCM+ Versions=== | ||
As of late 2025, we have the following versions of '''StarCCM+''' available on the cluster: | |||
<PRE> | |||
$ module avail starccm | |||
---------------------------- /shared/apps/modulefiles ---------------------------- | |||
starccm/15.02.007-R8 starccm/16.06.008-R8 starccm/18.06.006-R8 | |||
starccm/15.02.009-R8 starccm/17.02.007-R8 starccm/19.02.009-R8 | |||
starccm/15.04.008-R8 starccm/17.02.008-R8 starccm/20.04.007-R8 | |||
starccm/15.06.008-R8 starccm/17.04.007-R8 starccm/20.06.007-R8 | |||
starccm/16.02.008-R8 starccm/17.06.007-R8 | |||
starccm/16.04.007-R8 starccm/18.04.008-R8 | |||
</PRE> | |||
If using a '''single node''' for StarCCM+, job submission (for an interactive job) is simple and will use appropriate default settings: | |||
qsub -I -X -q epyc1 -l walltime=20:00:00 | |||
StarCCM+ can make use of the job scheduler attributes by automatically obtaining the number of cores and other resources from OpenPBS. In this case, the default number of cores and mpi processes for StarCCM+ are both 64 when using the epyc1 queue. So you can start your StarCCM+ run with: | |||
module load starccm/15.02.007-R8 (or any other version) | |||
starccm+ -bs pbs | |||
In this case, there is no need for StarCCM+ to be told to run the case in parallel with the selected number of cores/mpiprocs. | |||
This can get a bit more complex when running on multiple nodes or when requesting high memory nodes. In that case you would use job submission parameters as shown below: | |||
qsub -I -X -q epyc1 -l walltime=20:00:00,select=2:ncpus=64:mpiprocs=64:mem=500GB | |||
Requesting nodes that can satisfy those resources, two nodes with these attributes must exist. We have multiple nodes with 512GB in the epyc1 queue, meaning that this job will run on two machines that have at least the required amount of memory installed (on each node). The job will be queued until two machines like this will be available. If no machines with these resources exist, the job will stay in the queue forever. Therefore, you have to craft the submission string carefully. | |||
To accommodate high memory jobs, the nodes have been assigned priorities for assignment. Low memory jobs have the highest priority and will be assigned to nodes that can accommodate the request. High memory nodes have the lowest priority, meaning that they are the last ones given out to users. This makes it more likely that a high memory job can be run soon when the cluster is moderately loaded with jobs. | |||
StarCCM+ will always use the Intel MPI fabric. Other MPI versions do not work, even when selected on the command line. | |||
==OpenFOAM on the ARROW Cluster== | ==OpenFOAM on the ARROW Cluster== | ||
===Currently Available OpenFOAM Versions=== | ===Currently Available OpenFOAM Versions=== | ||
As of late 2025, we have the following versions of OpenFOAM available on the cluster: | |||
<PRE> | |||
$ module avail openfoam | |||
------------ /shared/apps/modulefiles ------------ | |||
openfoam/9 openfoam/13 openfoam/v2312 | |||
openfoam/10 openfoam/13-amd openfoam/v2406 | |||
openfoam/11 openfoam/v2212 | |||
openfoam/12 openfoam/v2306 | |||
</PRE> | |||
Contact us if you encounter problems; there can be various reasons why OpenFOAM may have trouble on certain hardware or when compiling dynamic code. When loading OpenFOAM modules, a number of dependencies will be automatically loaded for you, and you don't have to load those yourself. For example: | |||
<PRE> | |||
$ module load openfoam/13 | |||
Loading openfoam/13 | |||
Loading requirement: intel/2024.2.0/mpi/2021.13 gcc/gcc-12.1.0 | |||
$ module list | |||
Currently Loaded Modulefiles: | |||
1) intel/2024.2.0/mpi/2021.13 2) gcc/gcc-12.1.0 3) openfoam/13 | |||
</PRE> | |||
In this case, OpenFOAM 13 loads the Intel 2024 MPI module, and loads the GCC compiler 12.1. OpenFOAM was compiled from source, and has been compiled specifically with that compiler and MPI version, so it make little sense to use other compilers or MPI libraries. | |||
Note: We have found a problem with running the Intel 2024 MPI library in the amd64 queue. Therefore, we have a modified module that uses the Intel 2022 library (I know -- Intel 2022 gives you the 2021 MPI libraries, but that is the way Intel distributes this software): | |||
<PRE> | |||
$ module load openfoam/13-amd | |||
Loading mpi version 2021.7.0 | |||
Loading openfoam/13-amd | |||
Loading requirement: intel/2022.2.0/mpi/2021.7.0 gcc/gcc-12.1.0 | |||
$ module list | |||
Currently Loaded Modulefiles: | |||
1) intel/2022.2.0/mpi/2021.7.0 2) gcc/gcc-12.1.0 3) openfoam/13-amd | |||
</PRE> | |||
If you are compiling OpenFOAM yourself, the modules are of little help. You would need to select the appropriate MPI version and compiler before doing so, and then consistently load them before running your OpenFOAM executables. Within the "etc/bashrc" file in the source code tree, you want to set the MPI library to INTELMPI. As usual with self-compiled versions of OpenFOAM, you would "source etc/bashrc" to set up your personal environment to run your home-brew version of OpenFOAM. Contact us if you need to learn more about compiling OpenFOAM on the system. | |||
==Additional Software Applications and Libraries== | ==Additional Software Applications and Libraries== | ||
===Loadable GCC Compiler Versions=== | ===Loadable GCC Compiler Versions=== | ||
The Rocky 9.6 operating system uses the GCC 11.5 compiler. That should be sufficient for most users when compiling your own applications. In case you need to use either a more up-to-date compiler, or if you need an older compiler for compatibility, we make the following versions available as loadable modules. | |||
<PRE> | |||
$ module avail gcc | |||
------------ /shared/apps/modulefiles ------------ | |||
gcc/gcc-4.9.4 gcc/gcc-7.5.0 gcc/gcc-10.3.0 | |||
gcc/gcc-5.5.0 gcc/gcc-8.5.0 gcc/gcc-11.3.0 | |||
gcc/gcc-6.5.0 gcc/gcc-9.5.0 gcc/gcc-12.1.0 | |||
</PRE> | |||
Additional versions can be created and made available as modules as well. If you need a specific version that is not currently available, please ask us to compiler and install it. If necessary, we may be able to provide access to other compilers, for example LLVM. We do not provide access to proprietary compilers at this time. | |||
===MPI Libraries and Runtimes=== | ===MPI Libraries and Runtimes=== | ||
While we seem to have a variety of MPI versions and flavors available to users, the only MPI versions that allow us to run software over Infiniband are the Intel MPI libraries. Some of the installed alternatives are likely to fail, or will have a set of environment variables that have to be set. All major engineering software packages that we offer are pre-configured with specific MPI versions and settings that have been tested and/or provided by the vendors. | |||
Note: Some MPI libraries may seem to work. They may allow your MPI application to run. But inter-process network communication may travel through the rather slow and high-latency Ethernet fabric, making MPI applications very ineffective and are probably not worth while. | |||
===MatLab Runtimes=== | ===MatLab Runtimes=== | ||
We can install MatLAB run time libraries as needed and have them available as loadable modules. Recently, we had a problem with MatLAB run time libraries being identified as security vulnerabilities. Contact us if you need them installed for one of your projects. | |||
===Anaconda and variants (miniconda etc)=== | |||
Our current practice is to have users download and install their own versions of Anaconda and its variants in their own home directories. This allows for maximum flexibility when it comes to installable software modules, and users can maintain the installation, upgrades, and maintenance themselves. If you encounter issues, please contact us. One known side effect of Anaconda installations is a performance hit when starting your software, e.g. python scripts may take 30 seconds or more to execute. This is an artefact caused by the Lustre file system, which has been designed for large files accessible from many machines simultaneously. Performance on reading many small files has not been considered and is fairly poor. Again, contact us and we will design a solution for you as needed. | |||
Latest revision as of 21:38, February 26, 2026
Resource Summary View
To get started, users can query the overall status of resources on the cluster. The "qsum" script will list all queues and nodes, as well as how many are offline, down, free, or assigned to users. This is a script developed by our team, and may need to be updated if something goes wrong. Please contact us if you experience any problems.
Each queue groups a number of nodes together based on their hardware and software configurations. Nodes can be part of more than one queue, and there are other complex details that we are ignoring here for the purpose of keeping it simple.
Queues
Here is a very brief summary of what each of the queues is, and how to use them efficiently:
- a4000
- This is a queue that has three machines with 16 cores each; each of these machines is furthermore equipped with three A4000 GPUs. That makes a total of 9 A4000 GPUs available to users. Neither the GPUs nor the processors are particularly powerful these days, but they make for a good software development environment. The machines have 512GB of memory, which makes them a good platform for experimenting with GPU capabilities.
- a6000
- This is a queue that has only one single machine with 64 cores total, and is equipped with four A6000 GPUs. The system can be upgraded to 8 A6000 GPUs if needed. This is a decent GPU machine that can take a solid workload. The machine has 750GB of memory, which makes for a good production platform.
- amd16
- This is a queue with many of our older AMD-based 16-core machines, each of which equipped with 32GB of memory. While individual machines are a bit outdated, they are all interconnected with Infiniband and can provide a solid production workload in multi-node jobs over MPI without blocking the more current (and thus expensive) systems.
- epyc1/epyc2
- These are 2 separate queues with slightly different performance characteristics. Each of the groups is interconnected with Infiniband to provide a platform for large and demanding software packages, such as LS-Dyna and StarCCM+. They have between 256GB and 512GB of memory. Because licenses for these software packages (LS-Dyna and StarCCM+) are very expensive, these applications should use the two epyc queues for making optimum use of limited core licenses available to each package.
- xeon28
- This is a set of intermediate machines with 28 cores and 64GB of memory. They can be used for a variety of purposes, including MPI jobs and single node application software.
- virtual
- This is a set of nodes without MPI capabilities. They are virtual machines with 32GB each. They can be used for higher demand interactive applications that would interfere otherwise with other users on the login node machines. A user would submit interactive jobs to individual virtual machines and thus avoid any significant load on login nodes.
The Queue Summary Script (qsum)
$ qsum
=============== a4000 ==========================================================
Queue: "a4000" / nodes: 3 / down: 0 / offline: 0 / busy: 0 / available: 3
AVAILABLE (3): g001, g002, g003
=============== a6000 ==========================================================
Queue: "a6000" / nodes: 1 / down: 0 / offline: 0 / busy: 0 / available: 1
AVAILABLE (1): lambda01
=============== amd16 ==========================================================
Queue: "amd16" / nodes: 98 / down: 28 / offline: 5 / busy: 4 / available: 61
DOWN (28): n012, n017, n018, n022, n024, n030, n034, n037, n038, n040
n041, n042, n056, n057, n060, n062, n064, n069, n071, n072
n076, n079, n080, n086, n087, n089, n093, n094
OFFLINE (5): n001, n002, n003, n004, n010
ac.xliu1 (4): n026, n027, n028, n029
AVAILABLE (61): n005, n006, n007, n008, n009, n011, n013, n014, n015, n016
n019, n020, n021, n023, n025, n031, n032, n033, n035, n036
n039, n043, n044, n045, n046, n047, n048, n049, n050, n051
n052, n053, n054, n055, n058, n059, n061, n063, n065, n066
n067, n068, n070, n073, n074, n075, n077, n078, n081, n082
n083, n084, n085, n088, n090, n091, n095, n096, n097, n098
n099
=============== epyc1 ==========================================================
Queue: "epyc1" / nodes: 27 / down: 1 / offline: 0 / busy: 10 / available: 16
DOWN (1): a011
ac.ge.w* (2): a004, a027
ac.razm* (2): a015, a016
ac.vpra* (4): a005, a006, a008, a009
msitek (2): a001, a002
AVAILABLE (16): a003, a007, a010, a012, a013, a014, a017, a018, a019, a020
a021, a022, a023, a024, a025, a026
=============== epyc2 ==========================================================
Queue: "epyc2" / nodes: 20 / down: 0 / offline: 0 / busy: 16 / available: 4
cbojano* (16): a028, a030, a031, a032, a033, a034, a035, a036, a037, a038
a039, a040, a044, a045, a046, a047
AVAILABLE (4): a029, a041, a042, a043
=============== virtual ========================================================
Queue: "virtual" / nodes: 6 / down: 0 / offline: 0 / busy: 0 / available: 6
AVAILABLE (6): v001, v002, v003, v004, v005, v006
=============== xeon28 =========================================================
Queue: "xeon28" / nodes: 12 / down: 0 / offline: 0 / busy: 0 / available: 12
AVAILABLE (12): p001, p002, p003, p004, p005, p006, p007, p008, p009, p010
p011, p012
================================================================================
Queue Status and Monitoring Jobs
qstat
To find out out about the status of all running jobs on the cluster you can use the "qstat" command. Here is an example:
$ qstat Nov 20 18:30 ley@login3:Plots$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 3023.pbs STDIN msitek 4144:14* R epyc2 3029.pbs STDIN ley 76:46:53 R epyc2 3032.pbs STDIN msitek 2879:52* R epyc2 3033.pbs STDIN msitek 3687:29* R epyc2 3048.pbs foo.sh james.cook 0 Q amd16 3060.pbs of13.sh ley 310:47:* R epyc2 3061.pbs of13.sh ley 308:37:* R epyc2 3062.pbs of13.sh ley 308:02:* R epyc2 3063.pbs of13.sh ley 308:15:* R epyc2
The first column shows the job id, a unique identifier for all jobs ever submitted to the cluster. This job id is important when killing jobs, or for other actions you may need to take.
The next column shows the name of the job script. If the column shows STDIN, it means that this is an interactive job where a user can enter commands in a terminal window. This is particularly useful for model and software development task where the application has to be started and killed repeatedly.
The owner of the job is shown next. These are the user names of the various people using the cluster.
The last three columns indicate the current run time of the job, whether it is running (R) or waiting (Q) for execution. The last entry shows the queue in which the job is running.
qstat -an1
Adding a few options gives much more detail about each jobs.
qstat -an1
Nov 20 13:09 ley@login3:Plots$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3023.pbs msitek epyc2 STDIN 24360* 1 64 350gb 100:0 R 81:46 a028/0*64
3029.pbs ley epyc2 STDIN 21719* 2 128 100gb 200:0 R 72:31 a030/0*64+a031/0*64
3032.pbs msitek epyc2 STDIN 18102* 1 64 350gb 100:0 R 57:57 a029/0*64
3033.pbs msitek epyc2 STDIN 830486 1 64 350gb 100:0 R 57:53 a032/0*64
3048.pbs james.c* amd16 foo.sh -- 1 28 30gb 01:00 Q -- --
3060.pbs ley epyc2 STDIN 763101 1 64 350gb 48:00 R 06:42 a033/0*64
3061.pbs ley epyc2 STDIN 763947 1 64 350gb 48:00 R 06:40 a034/0*64
3062.pbs ley epyc2 STDIN 761473 1 64 350gb 48:00 R 06:39 a035/0*64
3063.pbs ley epyc2 STDIN 766205 1 64 350gb 48:00 R 06:40 a036/0*64
In this table, you can see how many nodes and cores are being used by each job. For example, job 3029 of the user "ley" shows that it is running on 2 nodes using a total of 128 cores. In addition to the elapsed time, the table also show the reserved time for this job. This allow you to estimate when a job will be definitely finalized (or killed by the system if still running).
The last column (without a header) is written because the option "-an1" was used. This useful to learn about which nodes are used by each job.
qstat -q
To learn more about the queues on the cluster, the "-q" option turns out to be useful.
$ qstat -q
server: pbs
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
virtual 30gb -- -- 1 0 0 -- E R
a4000 500gb -- -- 1 0 0 -- E R
a6000 750gb -- -- 1 0 0 -- E R
xeon28 -- -- -- 4 0 0 -- E R
amd16 -- -- -- 8 0 1 -- E R
epyc2 -- -- -- 2 14 0 -- E R
epyc1 -- -- -- 2 0 0 -- E R
----- -----
14 1
For each queue, some basic values are displayed. The first three queues listed above have a default memory allocation as shown, and the column "Node" indicates the maximum number of nodes that can be asked for at job submission time. For example, you can request just one node for a job from the first three queues (because these are queues where MPI makes no sense). The "xeon28" queue also for a maximum of 4 nodes per MPI job. The "amd16" queue has a maximum of 8 nodes per job, and the "epyc1" and "epyc2" queues have maxima of two nodes per job. These limitations can be changed by the administrator as needed. As shown above, this will prevent inefficient resource requests.
qstat -f
The command "qstat -f -F json 3029" retrieves extremely detailed stats on the running job 3029. The result can be returned in JSON format to be ready for further processing (shown below).
$ qstat -f -F json 3029
{
"timestamp":1763705353,
"pbs_version":"23.06.06",
"pbs_server":"pbs",
"Jobs":{
"3029.pbs":{
"Job_Name":"STDIN",
"Job_Owner":"ley@login4",
"resources_used":{
"cpupercent":98,
"cput":"76:46:53",
"hpmem":"0b",
"mem":"52428800kb",
"ncpus":128,
"vmem":"52428800kb",
"walltime":"78:09:32"
},
"job_state":"R",
"queue":"epyc2",
"server":"pbs",
"Checkpoint":"u",
"ctime":"Mon Nov 17 17:58:25 2025",
"Error_Path":"/dev/pts/0",
"exec_host":"a030/0*64+a031/0*64",
"exec_vnode":"(a030:ncpus=64:mem=52428800kb)+(a031:ncpus=64:mem=52428800kb)",
"Hold_Types":"n",
"interactive":"True",
"Join_Path":"n",
"Keep_Files":"n",
"Mail_Points":"a",
"mtime":"Fri Nov 21 00:07:59 2025",
"Output_Path":"/dev/pts/0",
"Priority":0,
"qtime":"Mon Nov 17 17:58:25 2025",
"Rerunable":"False",
"Resource_List":{
"mem":"100gb",
"mpiprocs":128,
"ncpus":128,
"nodect":2,
"place":"free",
"select":"2:ncpus=64:mem=50gb:mpiprocs=64",
"walltime":"200:00:00"
},
"stime":"Mon Nov 17 17:58:25 2025",
"session_id":2171964,
"jobdir":"/mnt/lustre/arrow/home/ley",
"substate":42,
"Variable_List":{
"PBS_O_HOME":"/mnt/lustre/arrow/home/ley",
"PBS_O_LANG":"en_US.UTF-8",
"PBS_O_LOGNAME":"ley",
"PBS_O_PATH":"/shared/apps/active/lstc/lsprepost/SP-4.5:/shared/apps/active/lstc/lsprepost/DP-4.3:/shared/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/opt/thinlinc/bin:/opt/thinlinc/sbin:/mnt/lustre/arrow/home/ley/.local/bin:/mnt/lustre/arrow/home/ley/bin",
"PBS_O_MAIL":"/var/spool/mail/ley",
"PBS_O_SHELL":"/bin/bash",
"PBS_O_WORKDIR":"/mnt/lustre/arrow/home/ley/Qualification/LS-Dyna/Rocky9/Seatbelt/Template",
"PBS_O_SYSTEM":"Linux",
"PBS_O_QUEUE":"epyc2",
"PBS_O_HOST":"login4"
},
"comment":"Job run at Mon Nov 17 at 17:58 on (a030:ncpus=64:mem=52428800kb)+(a031:ncpus=64:mem=52428800kb)",
"etime":"Mon Nov 17 17:58:25 2025",
"run_count":1,
"Submit_arguments":"-I -q epyc2 -l walltime=200:00:00,select=2:ncpus=64:mem=50gb:mpiprocs=64",
"project":"_pbs_project_default",
"Submit_Host":"login4"
}
}
}
Manual pages for qstat
To learn more about the "qstat" command, you can use the command "man qstat", which will print a lot of detailed information about the capabilities of this command.
$ man qstat
qstat(1B) PBS Professional qstat(1B)
NAME
qstat - display status of PBS jobs, queues, or servers
SYNOPSIS
Displaying Job Status
Default format:
qstat [-E] [-J] [-p] [-t] [-w] [-x] [[<job ID> | <destination>] ...]
Long format:
qstat -f [-F json | dsv [-D <delimiter>]] [-E] [-J] [-p] [-t] [-w]
[-x] [[<job ID> | <destination>] ...]
... <many more pages>
Job Submission Basics
Jobs are submitted into the system using the "qsub" application. This application can take many different options and allows for a lot of different resource requests to tell the cluster what to do. We are running OpenPBS 23.06.06 as our job scheduler. Here is a link to the User's Manual (of PBS PRO) if you want to explore gory details and capabilities. The User's Guide has about 240 pages, the Reference Guide has 500 pages, and the Big Book has 2500 pages. So there is a lot of information available. I also added job submission info for the LCRC cluster.
- Argonne's LCRC pages on job submissions on their clusters
- PBS Professional 2022.1 User's Guide
- PBS Professional 2022.1 Reference Guide
- Altair PBS Professional 2022.1 Big Book
The User's Guide can be very helpful to clarify some of the concepts and capabilities, but it can be hard to find the specific information you may be looking for. Please understand that we are no longer running TORQUE and MAUI, so the syntax for job submission is distinctively different yet quite similar.
The reference guide may be helpful to understand the complete syntax and full capabilities of the software.
The big book is what I had to use when configuring OpenPBS earlier this year. This includes all the tricky details needed to make the system work smoothly for us. It's a bit scary to look at a PDF file that is 2500 pages long, but that is nothing compared to the StarCCM+ manuals.
IMPORTANT NOTE: The following sections are important to understand. They explain how jobs are submitted and then scjeduled for execution based on resources available and the specific need of the user.
The following sections explain the various tasks you may want to submit fir execution, ordered from simple to complex.
- General Batch Jobs
- Requesting a Single Node for a Job
- Requesting Multiple Nodes for a Job
- Embedded Job Resource Requests
- Interactive Jobs
- Interactive Jobs with X-Windows GUI Applications
- Running Multiple Jobs on Single Nodes
- Running Jobs using GPUs
General Batch Jobs
Let's get started with a very basic usage of the system. Let's assume you have a simple application, and you want to execute it on a cluster node. Let's also assume that this is a very simple application, one that runs on one or a few cores, doesn't require any keyboard interaction with the user, doesn't need the user to see what's typically written to the screen, and writes its output to files. In this case, we can submit this application as a batch job, which will place it into an execution queue and process it as soon as a node becomes available.
If the application requires more cores than a single node can provide, we can run the application over Infiniband with MPI message passing. In this case, we need to understand the concept of MPI applications a bit better. In both cases, we get started by creating a folder on the file system. Naming conventions are important so that you can distinguish the jobs by folder name.
For both of the above scenarios, you would typically create a Bash shell script, and then submit the script into one of the queues for eventual execution.
Requesting a Single Node for a Job
Let's try something rather trivial to get used to the concept. Create yourself a folder, for example "myjobfolder". Within that folder, create a job submission script. That script can have any name, but something short and simple may be best. Let's assume you create a file called "cluster.job". The file doesn't have to have that extension. Any file name will do. But using the same filename for all of your jobs helps finding your way around the many files that will be created over time. The "cluster.job" file should look something like this:
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # now we sleep for 60 seconds and waste time. This is a placeholder for your application, # which would be doing useful work for you. sleep 60 # # and after doing things, we may want to write something into a file to show that # our jobs is done. echo `date` > info.log #
This can be submitted without detailed resource specifications (except for the walltime, which is be default 0:00:00):
$ qsub -q virtual -l walltime=1:00:00 cluster.job 3072.pbs
Wait a little, then check the status of running jobs:
$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3023.pbs msitek epyc2 STDIN 24360* 1 64 350gb 100:0 R 83:17 a028/0*64
3029.pbs ley epyc2 STDIN 21719* 2 128 100gb 200:0 R 74:00 a030/0*64+a031/0*64
3033.pbs msitek epyc2 STDIN 830486 1 64 350gb 100:0 R 59:23 a032/0*64
3048.pbs james.c* amd16 foo.sh -- 1 28 30gb 01:00 Q -- --
3060.pbs ley epyc2 STDIN 763101 1 64 350gb 48:00 R 08:10 a033/0*64
3061.pbs ley epyc2 STDIN 763947 1 64 350gb 48:00 R 08:10 a034/0*64
3070.pbs ley epyc2 STDIN 766847 1 64 350gb 48:00 R 07:23 a042/0*64
3072.pbs ley virtual cluster.j* 230230 1 4 30gb 01:00 E 00:01 v001/0*4
In this particular example, we are sending this job to the queue "virtual". This queue, by default, allocates 30GB of memory to the job, and runs on 1 node with 4 cores. This is sufficient capacity to run quite a workload. When submitting a job to a single node, reasonable maximum allocations are automatically assigned, and the user doesn't have to worry about running out of memory or how many cores he will be using.
The only required argument is the "walltime" argument. By default, the job will quit as soon as it is submitted. This indicates to the user that he forgot to provide the "walltime" argument.
When the job disappears from the job list, it is done. At this point, you will find the file "info.log" in your job folder.
$ cat info.log Thu Nov 20 08:00:31 PM CST 2025
Requesting Multiple Nodes for a Job
To run jobs on multiple nodes, you will be likely executing jobs using MPI, the message passing interface. This establishes high-speed low-latency interconnections between the cores on one machine and the cores on the other machines. Data transfer does not require involvement of the cores themselves. Instead, the core tell the InfiniBand interconnect (and cores on the same node through shared memory) to transfer the data through RDMA, remoted direct memory access. The cores don't need to spend CPU cycles on copying data, but rather simply access the data once it has been copied by the Infiniband fabric. This makes for extremely efficient remote memory access, and message passing is used to coordinate data transfer between the cores no matter where they are located on any of the nodes.
On our cluster, MPI-aware applications like OpenFOAM, StarCCM+, and LS-Dyna can be loaded as modules, which then automatically selects the most appropriate MPI library to use. The software applications have been tested to ensure that they work out-of-the box if a user selects any specific version of any of the applications.
The following is a very trivial example for the MPI execution of a very simple executable, with one copy running on each core of the nodes allocated to the job. It doesn't perform any real work and just wastes resources for a short time, but it illustrates how execution on the cores of various nodes works.
Like in the previous section, we start with a simple job script that we submit to an appropriate queues. In this case, we pick a queue that has machines with Infiniband interfaces supporting efficient communications. Let's assume we edit a file with the name "parallel.job" like this:
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # to execute a simple command on all of the cores of all of the nodes allocated to the job, # we need to make one of the MPI versions available. Let's use one of the most up-to-date # MPI library available on the cluster module load intel/2024.2.0/mpi/2021.13 # # now we are apply a few settings that ensure that the MPI library will use the highest-performing # Infiniband Interconnect, as well as a few options to tell MPI how to interface nodes with # each other and which specific Infiniband adapter to use. This is complex and requires in-depth # knowledge of the QLogic Infiniband adapters we are using. It is unlikely that you will ever have to # deal with these options, because the "module load" command for the engineering applications we provide # on ARROW will handle all those details transparently without the user needing to understand the details. export I_MPI_HYDRA_BOOTSTRAP=ssh export MPI_DEVICE=rdma:ofa-v2-ib0 export UCX_NET_DEVICES=qib0:1 # # it doesn't make much sense, but in this example we are executing the OS command "uptime" on all cores # of the nodes allocated to this job. The output from each core is written to the file info.log. We # will find 56 lines of output in the file info.log, each created by the corresponding core executing # the uptime command. mpirun uptime > info.log #
A good queue to test scripts is the "xeon28" queue. In the queue, we have 2 14-core Xeon processers per node, so that means that each node has 56 actual cores. We do not consider hyperthreading when doing parallel computing. 56 actual cores is what's being used here. The job submission will look like this:
qsub -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G parallel.job
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | + --- the name of the job script to execute
| | | | | | + ----- don't forget to specify gigabytes
| | | | | + ------- the amount of memory to request per node
| | | | + -------------- the number of MPI tasks per nodes
| | | + -------------------------- the number of cores per node
| | + ---------------------------------- the number of nodes to select in the queue
| + ------------------------------------------------- the requested time, in this case 1h
+ --------------------------------------------------------------------- the queue to be used for the job
At this point, the job has created a file "info.log" with 56 lines, one per core:
22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 ...
In this simple example, the lines look all the same. Upon close examination through, you can find slightly different values for some of the lines. Some lines say that the machine is up for 23 days and 9:28, while others say 23 days and 9:53. Because all 28 cores of a node would see the same uptime of the server, half of the entries show one time stamp, and the other 28 cores show the other one. That demonstrates that the 56 processes have been running independently on 2 nodes.
Embedded Job Resource Requests
The job script can be modified to embed the resource requests in form of a series of #PBS statements at the beginning of the script file. This is a very common practice use at many HPC installations and job submission engines. Let's go back to the previous example where we run the script on two nodes in parallel. That is the "parallel.job" script file again:
#!/bin/bash # #PBS -q xeon28 #PBS -l walltime=1:0:0 #PBS -l select=2:ncpus=28:mpiprocs=28:mem=60G # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # to execute a simple command on all of the cores of all of the nodes allocated to the job, # we need to make one of the MPI versions available. Let's use one of the most up-to-date # MPI library available on the cluster module load intel/2024.2.0/mpi/2021.13 # # now we are apply a few settings that ensure that the MPI library will use the highest-performing # Infiniband Interconnect, as well as a few options to tell MPI how to interface nodes with # each other and which specific Infiniband adapter to use. This is complex and requires in-depth # knowledge of the QLogic Infiniband adapters we are using. It is unlikely that you will ever have to # deal with these options, because the "module load" command for the engineering applications we provide # on ARROW will handle all those details transparently without the user needing to understand the details. export I_MPI_HYDRA_BOOTSTRAP=ssh export MPI_DEVICE=rdma:ofa-v2-ib0 export UCX_NET_DEVICES=qib0:1 # # it doesn't make much sense, but in this example we are executing the OS command "uptime" on all cores # of the nodes allocated to this job. The output from each core is written to the file info.log. We # will find 56 lines of output in the file info.log, each created by the corresponding core executing # the uptime command. mpirun uptime > info.log #
If the resource requests are embedded within the file, they don't have to be specified on the command line any longer (the command line overrides the embedded specifications though). This may be convenient, because all the user has to do for job submission is the following:
qsub parallel.job
Here is an example with more resource specifications and job settings that affect the behavior of the job:
#!/bin/bash # #PBS -q xeon28 #PBS -l walltime=1:0:0 #PBS -l select=2:ncpus=28:mpiprocs=28:mem=60G #PBS -A Account #PBS -j oe #PBS -N JobName #PBS -e log.error #PBS -o log.output #PBS -m bae # ...
I leave this to you as an exercise to figure out what the various options mean and how to specify them. There are many more, all documented in the PBS PRO manual (see above). Most of them are not terribly relevant and can be omitted.
Interactive Jobs
On ARROW, we don't restrict queues to be used only in batch mode. While batch mode is efficient for lining up a lot of work to be executed one after the other, ARROW has been designed to allow efficient model and software development in interactive mode. We have always ensured to have more computers than minimally needed to make it possible to dedicate resources to developers as needed, even if that means wasted CPU cycles. At times, we may ask you to limit the number of interactive jobs so that a large batch workload can be processed efficiently. This happens from time to time, and we have our users coordinate this with each other.
Let's assume that you are developing an MPI application, or you are working on a complex OpenFOAM model that requires to start parallel processes over and over again just to find a bug and then fix it quickly. To do that, you can request an interactive job by adding the "-I" option to the job submission command (this is an uppercase I). Let's go to the parallel multi-node example from above:
qsub -I -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | + --- don't forget to specify gigabytes
| | | | | | + ----- the amount of memory to request per node
| | | | | + ------------ the number of MPI tasks per nodes
| | | | + ------------------------ the number of cores per node
| | | + -------------------------------- the number of nodes to select in the queue
| | + ----------------------------------------------- the requested time, in this case 1h
| + ------------------------------------------------------------------- the queue to be used for the job
+ ------------------------------------------------------------------------ request an interactive job <<===
When running interactive jobs with the "-I" parameter, we don't specify av job script at the end of the submission command. The interactive job will instead start (once the nodes are available) in interactive mode, meaning that the terminal session changes over from being a series of commands executed on the login server to being a series of commands being executed on the first node of the group of nodes that are allocated to the job. At this point, you can change to the desired working directory, but what you do with the allocated resources is entirely up to you. You can load modules, including MPI libraries, and then issue the commands for your application interactively and see how they execute. If you start an "mpirun", the cores on your allocated secondary node will work as expected. There is no difference to batch mode, other than you having the ability to execute lines of commands at will.
Interactive Jobs with X-Windows GUI Applications
Interactive use can go further than that. With some of our software applications, like StarCCM+, you can run an interactive GUI application where you control the computational work from within the applications' GUI. Within the GUI, you can control execution of the numerical solver that runs on as many cores as you requested, while being able to reconfigure the case through the GUI as well. Furthermore, you can visualize developing results on the fly by creating complex plots and visualizations.
All that is need is an option "-X" being used as part of the job submission, like this:
qsub -X -I -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G
^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | + --- don't forget to specify gigabytes
| | | | | | | + ----- the amount of memory to request per node
| | | | | | + ------------ the number of MPI tasks per nodes
| | | | | + ------------------------ the number of cores per node
| | | | + -------------------------------- the number of nodes to select in the queue
| | | + ----------------------------------------------- the requested time, in this case 1h
| | + ------------------------------------------------------------------- the queue to be used for the job
| + ------------------------------------------------------------------------ request an interactive job
+ --------------------------------------------------------------------------- request GUI capabilities <<===
Running Multiple Jobs on Single Nodes
A feature that is new on ARROW is the ability to run multiple jobs on a single node. Let's assume that you are performing a sensitivity analysis on an existing model, and the model is simple enough to return results within a reasonable time on just a few cores of a higher end machine (maybe you are running SMP versions of LS-Dyna). Our high end machines have 64 cores, so lets assume you have an LS-Dyna model that runs well on 8 cores and doesn't use a whole lot of memory. In this case, you can submit individual jobs that request simply 8 cores and a fraction of the available memory available on the node, and all jobs execute independently from each other. Each job is fit into a slot where available. It is not very different from using whole nodes for everything. The important consideration is that each job is cleanly constrained into it's allotted resources using the CGROUPS functionality of modern operating systems. Because an abusive user cannot use more cores or more memory than allocated to his job, other users can safely run smaller jobs on the same node.
Lets assume that we have a number of smaller jobs that we want to run on a single node in the "xeon28" queue. Each job would be submitted by using reduced resources that allow for sharing but that guarantee that the jobs will be run successfully. In this case, you can submit many jobs in the following manner (with a job script for the small jobs, each of which can request varying resources if needed - some may want to run on 5 cores, others on 3):
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # now we sleep for 300 seconds and waste time. This is a placeholder for your application, # which would be doing useful work for you. sleep 300 #
Now we submit a variety of these jobs (11 total in this example) to the "xeon28" queue for execution (note that the first few jobs request different amounts of memory and core counts):
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=12:mpiprocs=12:mem=5G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=10:mpiprocs=10:mem=7G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=8:mpiprocs=8:mem=9G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=16:mpiprocs=16:mem=20G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
They are now running in the order of submission, allocated on as few nodes in the "xeon28" queue as necessary. Only 2 nodes are being loaded quite heavily, and 4 more cores are in use on a third node.
Nov 20 23:34 ley@login3:myjobfolder$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3082.pbs ley xeon28 small.job 813221 1 12 5gb 01:00 R 00:01 p001/0*12
3083.pbs ley xeon28 small.job 813288 1 10 7gb 01:00 R 00:01 p001/1*10
3084.pbs ley xeon28 small.job 671792 1 8 9gb 01:00 R 00:01 p002/0*8
3085.pbs ley xeon28 small.job 671845 1 16 20gb 01:00 R 00:01 p002/1*16
3086.pbs ley xeon28 small.job 813361 1 2 2gb 01:00 R 00:00 p001/2*2
3087.pbs ley xeon28 small.job 813413 1 2 2gb 01:00 R 00:00 p001/3*2
3088.pbs ley xeon28 small.job 813464 1 2 2gb 01:00 R 00:00 p001/4*2
3089.pbs ley xeon28 small.job 671912 1 2 2gb 01:00 R 00:00 p002/2*2
3090.pbs ley xeon28 small.job 671969 1 2 2gb 01:00 R 00:00 p002/3*2
3091.pbs ley xeon28 small.job 632092 1 2 2gb 01:00 R 00:00 p003/0*2
3092.pbs ley xeon28 small.job 632100 1 2 2gb 01:00 R 00:00 p003/1*2
This is a particularly effective strategy to run concurrently many cases that don't scale well beyond a few cores. When running them on fewer cores but many of them at the same time, the overall processing rate will be much higher than executing them the traditional way.
Running Jobs using GPUs
The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with GPUs for ML/AI applications. The cluster doesn't have a whole lot of GPUs at this point. We have three machines with three A4000 GOUs, a total of 9 A4000 GPUs. Then we have a much more powerful single machine with our four A6000 GPUs.
Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast NVLINK between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and GPUs that have to jump between processors when communicating, and then the whole issue of having to go possibly through PCIe bridges.
On our system, we are providing the ability to work mostly with individual GPUs. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time.
qsub -q a4000 -I -l walltime=1:0:0 -l select=1:ncpus=5:mem=150G:ngpus=1
With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the reserved GPU.
To run a massive GPU job on 64 cores with 4 A6000 GPUs, submit the job like this:
qsub -q a6000 -I -l walltime=1:0:0 -l select=1:ncpus=64:mem=725G:ngpus=4
Manual pages for qsub
To learn more about the "qsub" command, you can use the command "man qsub", which will print a lot of detailed information about the capabilities of this command.
$ man qsub
qsub(1B) PBS Professional qsub(1B)
NAME
qsub - submit a job to PBS
SYNOPSIS
qsub [-a <date and time>] [-A <account string>] [-c <checkpoint spec>]
[-C <directive prefix>] [-e <path>] [-f] [-h]
[-I [-G [-- <GUI application/script>]] | [-X]] [-j <join>]
[-J <range> [%<max subjobs]] [-k <discard>] [-l <resource list>]
[-m <mail events>] [-M <user list>] [-N <name>] [-o <path>]
[-p <priority>] [-P <project>] [-q <destination>] [-r <y|n>]
[-R <remove options>] [-S <path list>] [-u <user list>]
[-v <variable list>] [-V] [-W <additional attributes>] [-z]
[- | <script> | -- <executable> [<arguments to executable>]]
qsub --version
DESCRIPTION
You use the qsub command to submit a batch job to PBS. Submitting a PBS job specifies a task, requests resources, and
sets job attributes.
... <many more pages>
LS-Dyna on the ARROW Cluster
Currently Available LS-Dyna Versions
The following is a list of LS-Dyna versions available on ARROW after the latest reconfiguration of the system. As per LSTC/ANSYS, versions before 14.0.0 are not necessarily fully supported any longer because they are supposedly not compatible with modern operating systems and cannot be made to work reliably. We have tested the listed older versions of LS-Dyna and they have passed basic tests. They may not behave exactly as they did on the old CentOS 7 operating system, and time will show whether they can still be used or whether you will need to update your models and use a fully supported version.
All versions are loaded using the "module load" command. Versions can be listed with the "module avail ls-dyna" command. To load one of the modules, use the following syntax:
module load ls-dyna/14.2.0/mpi-d8-ifort190-avx512
^ ^ ^ ^ ^ ^
| | | | | + --- specify the extended instruction set needed for execution
| | | | + ------------ load the version of the compiler that was used to create this
| | | + --------------- load the version that supports double precision variables
| | + ------------------- load the MPP (MPI) version of LS-Dyna
| + -------------------------- load specifically version 14.2.0
+ ---------------------------------- load a version of LS-Dyna
The version string is composed of multiple elements to indicate variants in compilers and compiler options. Use the following guideline to choose an appropriate version to load:
- "1" or "mpi" indicates whether this is a single node version of LS-Dyna (SMP) or whether this is a multi-node MPI version (MPP). All MPI versions use the IntelMPI 2022 libraries which have been tested thoroughly on ARROW. MPI versions will use the Infiniband Network of ARROW for high-speed and low-latency inter-process communication using RDMA (remote direct memory access).
- All LS-Dyna versions are available in either floating point or double precision variants. Floating point variants use 4 bytes to represent a value, and double precision variants use 8 bytes. There are pros and cons for choosing one over the other variant. With regards to computational efficiency, both perform nearly the same because all machines are equipped with 64-bit CPUs.
- "f4" floating point versions
- Pros: These require significantly less memory to run. Results occupy less disk space, and can be transferred significantly faster into and out of ARROW.
- Cons: The numerical resolution is limited to 7 significant digits, which is often undesirable when dealing with mathematical operations on small and large numbers at the same time.
- "r8" double precision versions
- Pros: The numerical resolution is about twice the number of significant digits compare to "f4", which helps when when dealing with mathematical operations on small and large numbers at the same time.
- Cons: These require more memory to run. Results occupy more disk space, and it takes longer to transfer data into and out of ARROW.
- "f4" floating point versions
- There are two more identifiers to choose from when it comes to the variants of the executables: the specific compiler used to create the executable and the specific processor instruction set required for running the executable.
- For modern versions of LS-Dyna, two compilers have been used by the developers to create LS-Dyna executables: the Intel Fortran Compiler and the AOCC (AMD Optimizing C/C++ and Fortran) compiler. Both variants of the software are supported on ARROW. This gives users the opportunity to choose an alternate variant of the same LS-Dyna version when running into bugs or crashes.
- The variants based on the various instruction set extensions (SSE2, AVX2, AVX512, and so on) gives users even more options when choosing an alternate LS-Dyna variant of the same version when running into bugs or crashes. These instruction sets are mostly related to performance gains on specific processors. We have not performed thorough performance tests and cannot recommend specific versions right now.
$ module avail ls-dyna --------------------------------------------- /shared/apps/modulefiles --------------------------------------------- ls-dyna/09.3.1/1-d8-ifort131 ls-dyna/12.2.1/mpi-f4-ifort160-sse2 ls-dyna/14.2.0/mpi-f4-ifort190-avx512 ls-dyna/09.3.1/1-f4-ifort131 ls-dyna/12.2.2/1-d8-ifort160-sse2 ls-dyna/14.2.0/mpi-f4-ifort190-sse2 ls-dyna/09.3.1/mpi-d8-ifort131-avx2 ls-dyna/12.2.2/1-f4-ifort160-sse2 ls-dyna/15.0.2/1-d8-ifort190-sse2 ls-dyna/09.3.1/mpi-d8-ifort131-avx512 ls-dyna/12.2.2/mpi-d8-aocc400-avx2 ls-dyna/15.0.2/1-f4-ifort190-sse2 ls-dyna/09.3.1/mpi-f4-ifort131-avx2 ls-dyna/12.2.2/mpi-d8-ifort160-avx2 ls-dyna/15.0.2/mpi-d8-aocc400-avx2 ls-dyna/09.3.1/mpi-f4-ifort131-avx512 ls-dyna/12.2.2/mpi-d8-ifort160-sse2 ls-dyna/15.0.2/mpi-d8-ifort190-avx2 ls-dyna/10.2.0/1-d8-ifort160 ls-dyna/12.2.2/mpi-f4-aocc400-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-avx512 ls-dyna/10.2.0/1-f4-ifort160 ls-dyna/12.2.2/mpi-f4-ifort160-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-sse2 ls-dyna/11.0.0/1-d8-ifort160 ls-dyna/12.2.2/mpi-f4-ifort160-sse2 ls-dyna/15.0.2/mpi-f4-aocc400-avx2 ls-dyna/11.0.0/1-f4-ifort160 ls-dyna/13.0.0/1-d8-ifort190 ls-dyna/15.0.2/mpi-f4-ifort190-avx2 ls-dyna/11.1.0/1-d8-ifort160-sse2 ls-dyna/13.0.0/1-f4-ifort190 ls-dyna/15.0.2/mpi-f4-ifort190-avx512 ls-dyna/11.1.0/1-f4-ifort160-sse2 ls-dyna/13.0.0/mpi-d8-ifort190-avx2 ls-dyna/15.0.2/mpi-f4-ifort190-sse2 ls-dyna/11.2.0/1-d8-ifort160 ls-dyna/13.0.0/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/1-d8-aocc420-avx2 ls-dyna/11.2.0/1-f4-ifort160 ls-dyna/13.0.0/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/1-d8-aocc420-avx512 ls-dyna/11.2.0/mpi-f4-ifort160-avx2 ls-dyna/13.0.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/1-d8-ifort190-sse2 ls-dyna/11.2.0/mpi-f4-ifort160-sse2 ls-dyna/13.1.0/mpi-d8-aocc310-avx2 ls-dyna/16.0.0/1-f4-aocc420-avx2 ls-dyna/11.2.1/1-d8-ifort160 ls-dyna/13.1.0/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/1-f4-aocc420-avx512 ls-dyna/11.2.1/1-f4-ifort160 ls-dyna/13.1.0/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/1-f4-ifort190-sse2 ls-dyna/11.2.1/mpi-d8-ifort160-avx2 ls-dyna/13.1.0/mpi-f4-aocc310-avx2 ls-dyna/16.0.0/mpi-d8-aocc420-avx2 ls-dyna/11.2.1/mpi-d8-ifort160-sse2 ls-dyna/13.1.0/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-aocc420-avx512 ls-dyna/11.2.1/mpi-f4-ifort160-avx2 ls-dyna/13.1.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-d8-ifort190-avx2 ls-dyna/11.2.1/mpi-f4-ifort160-sse2 ls-dyna/13.1.1/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-ifort190-avx512 ls-dyna/11.2.2/mpi-d8-ifort160-avx2 ls-dyna/13.1.1/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/mpi-d8-ifort190-sse2 ls-dyna/11.2.2/mpi-d8-ifort160-sse2 ls-dyna/13.1.1/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/mpi-f4-aocc420-avx2 ls-dyna/11.2.2/mpi-f4-ifort160-avx2 ls-dyna/13.1.1/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-f4-aocc420-avx512 ls-dyna/11.2.2/mpi-f4-ifort160-sse2 ls-dyna/14.0.0/1-d8-ifort190 ls-dyna/16.0.0/mpi-f4-ifort190-avx2 ls-dyna/12.1.0/1-d8-ifort160 ls-dyna/14.0.0/1-f4-ifort190 ls-dyna/16.0.0/mpi-f4-ifort190-avx512 ls-dyna/12.1.0/1-f4-aocc310 ls-dyna/14.0.0/mpi-d8-aocc310-avx2 ls-dyna/16.0.0/mpi-f4-ifort190-sse2 ls-dyna/12.1.0/1-f4-ifort160 ls-dyna/14.0.0/mpi-d8-ifort190-avx2 ls-dyna/16.1.0/mpi-d8-aocc420-avx2 ls-dyna/12.1.0/mpi-d8-aocc310-avx2 ls-dyna/14.0.0/mpi-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-aocc420-avx512 ls-dyna/12.1.0/mpi-d8-ifort160-avx2 ls-dyna/14.0.0/mpi-f4-ifort190-avx2 ls-dyna/16.1.0/mpi-d8-ifort190-avx2 ls-dyna/12.1.0/mpi-d8-ifort160-sse2 ls-dyna/14.0.0/mpi-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-ifort190-avx512 ls-dyna/12.1.0/mpi-f4-aocc310-avx2 ls-dyna/14.1.0/1-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-ifort190-sse2 ls-dyna/12.1.0/mpi-f4-ifort160-avx2 ls-dyna/14.1.0/1-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-f4-aocc420-avx2 ls-dyna/12.1.0/mpi-f4-ifort160-sse2 ls-dyna/14.1.0/mpi-d8-aocc400-avx2 ls-dyna/16.1.0/mpi-f4-aocc420-avx512
Submitting an LS-Dyna Job
IMPORTANT NOTE: The job/queue manager can now track the number of LS-Dyna licenses given out to individual jobs. At submission time, it is not possible to know what software a user may run. But by adding the clause "-l dynalic" at submission time, the queue manager can calculate the total number of cores required and keep track of LS-Dyna licenses used by the job. When loading a version of LS-Dyna, a check will be performed, and LS-Dyna will be prevented from running if the "-l dynalic" clause was not used when submitting the job.
Furthermore, careful consideration should be given with regards to choice of resources for an LS-Dyna job. With 64 cores available on a single node in the "epyc1" and "epyc2" queues, it may be counterproductive to run a job on two nodes instead of a single node. Users should run their jobs with different numbers of nodes and determine whether performance increases. It may well decrease when running a job on two or more nodes. The outcome of such tests will tell what the best allocation of resources will be.
Most users use a job script like the following. All methods for job submission the the previous chapters apply as well, so there is a lot of flexibility:
#!/bin/bash # #PBS -q epyc1 #PBS -l walltime=12:0:0 #PBS -l select=2:ncpus=64:mpiprocs=64:mem=225G,dynalic #PBS -N JobName #PBS -e log.error #PBS -o log.output # cd $PBS_O_WORKDIR # module load ls-dyna/12.2.1/mpi-f4-ifort160-avx2 module load dynamore/current # mpirun ls-dyna i=main.k memory1=300m memory2=100m > dyna.log # # when using the Dynamore tools, you can start something like this at the end DM.plotcprs.lnx -merge >> dyna.log #
LSTC Tools: LS-OPT and LS-PREPOST
For the new Rocky 9 cluster, I have not looked deeply into the ls-opt and ls-prepost packages that were installed. I noticed though that the LSTC server provided access to much newer versions of both software packages. If you would like to learn more or have a specific version in mind, I can easily download and install it for you.
$ module avail ls-opt ----------------------------------------------- /shared/apps/modulefiles ------------------------------------------------ ls-opt/5.1.1 ls-opt/6.0.0 ls-opt/7.0.0 ls-opt/7.0.2 ls-opt/2022R2 ls-opt/5.2.1 ls-opt/6.1.0 ls-opt/7.0.1 ls-opt/2022R1 ls-opt/2023R1
To start the software, type:
lsopt
$ module avail ls-prepost ----------------------------------------------- /shared/apps/modulefiles ------------------------------------------------ ls-prepost/4.5.10 ls-prepost/4.8.13 ls-prepost/4.8.30 ls-prepost/4.9.16 ls-prepost/4.10.7
To start the software, type:
lsprepost
Dynamore Software
The Dynamore tools are available as a module:
module load dynamore/current
We typically acquire a yearly license for the tools as we purchase licenses for LS-Dyna.
Vendor License File Installation
If you would like for us to install a vendor license for LS-Dyna models, please contact us for the required information. We can send you the general LS-Dyna license file to show the host ids for the license server. Using that information, your vendor should be able to create a vendor license file. Please send that file to us per Email or by other means.
StarCCM+ on the ARROW Cluster
Currently Available StarCCM+ Versions
As of late 2025, we have the following versions of StarCCM+ available on the cluster:
$ module avail starccm ---------------------------- /shared/apps/modulefiles ---------------------------- starccm/15.02.007-R8 starccm/16.06.008-R8 starccm/18.06.006-R8 starccm/15.02.009-R8 starccm/17.02.007-R8 starccm/19.02.009-R8 starccm/15.04.008-R8 starccm/17.02.008-R8 starccm/20.04.007-R8 starccm/15.06.008-R8 starccm/17.04.007-R8 starccm/20.06.007-R8 starccm/16.02.008-R8 starccm/17.06.007-R8 starccm/16.04.007-R8 starccm/18.04.008-R8
If using a single node for StarCCM+, job submission (for an interactive job) is simple and will use appropriate default settings:
qsub -I -X -q epyc1 -l walltime=20:00:00
StarCCM+ can make use of the job scheduler attributes by automatically obtaining the number of cores and other resources from OpenPBS. In this case, the default number of cores and mpi processes for StarCCM+ are both 64 when using the epyc1 queue. So you can start your StarCCM+ run with:
module load starccm/15.02.007-R8 (or any other version) starccm+ -bs pbs
In this case, there is no need for StarCCM+ to be told to run the case in parallel with the selected number of cores/mpiprocs.
This can get a bit more complex when running on multiple nodes or when requesting high memory nodes. In that case you would use job submission parameters as shown below:
qsub -I -X -q epyc1 -l walltime=20:00:00,select=2:ncpus=64:mpiprocs=64:mem=500GB
Requesting nodes that can satisfy those resources, two nodes with these attributes must exist. We have multiple nodes with 512GB in the epyc1 queue, meaning that this job will run on two machines that have at least the required amount of memory installed (on each node). The job will be queued until two machines like this will be available. If no machines with these resources exist, the job will stay in the queue forever. Therefore, you have to craft the submission string carefully.
To accommodate high memory jobs, the nodes have been assigned priorities for assignment. Low memory jobs have the highest priority and will be assigned to nodes that can accommodate the request. High memory nodes have the lowest priority, meaning that they are the last ones given out to users. This makes it more likely that a high memory job can be run soon when the cluster is moderately loaded with jobs.
StarCCM+ will always use the Intel MPI fabric. Other MPI versions do not work, even when selected on the command line.
OpenFOAM on the ARROW Cluster
Currently Available OpenFOAM Versions
As of late 2025, we have the following versions of OpenFOAM available on the cluster:
$ module avail openfoam ------------ /shared/apps/modulefiles ------------ openfoam/9 openfoam/13 openfoam/v2312 openfoam/10 openfoam/13-amd openfoam/v2406 openfoam/11 openfoam/v2212 openfoam/12 openfoam/v2306
Contact us if you encounter problems; there can be various reasons why OpenFOAM may have trouble on certain hardware or when compiling dynamic code. When loading OpenFOAM modules, a number of dependencies will be automatically loaded for you, and you don't have to load those yourself. For example:
$ module load openfoam/13 Loading openfoam/13 Loading requirement: intel/2024.2.0/mpi/2021.13 gcc/gcc-12.1.0 $ module list Currently Loaded Modulefiles: 1) intel/2024.2.0/mpi/2021.13 2) gcc/gcc-12.1.0 3) openfoam/13
In this case, OpenFOAM 13 loads the Intel 2024 MPI module, and loads the GCC compiler 12.1. OpenFOAM was compiled from source, and has been compiled specifically with that compiler and MPI version, so it make little sense to use other compilers or MPI libraries.
Note: We have found a problem with running the Intel 2024 MPI library in the amd64 queue. Therefore, we have a modified module that uses the Intel 2022 library (I know -- Intel 2022 gives you the 2021 MPI libraries, but that is the way Intel distributes this software):
$ module load openfoam/13-amd Loading mpi version 2021.7.0 Loading openfoam/13-amd Loading requirement: intel/2022.2.0/mpi/2021.7.0 gcc/gcc-12.1.0 $ module list Currently Loaded Modulefiles: 1) intel/2022.2.0/mpi/2021.7.0 2) gcc/gcc-12.1.0 3) openfoam/13-amd
If you are compiling OpenFOAM yourself, the modules are of little help. You would need to select the appropriate MPI version and compiler before doing so, and then consistently load them before running your OpenFOAM executables. Within the "etc/bashrc" file in the source code tree, you want to set the MPI library to INTELMPI. As usual with self-compiled versions of OpenFOAM, you would "source etc/bashrc" to set up your personal environment to run your home-brew version of OpenFOAM. Contact us if you need to learn more about compiling OpenFOAM on the system.
Additional Software Applications and Libraries
Loadable GCC Compiler Versions
The Rocky 9.6 operating system uses the GCC 11.5 compiler. That should be sufficient for most users when compiling your own applications. In case you need to use either a more up-to-date compiler, or if you need an older compiler for compatibility, we make the following versions available as loadable modules.
$ module avail gcc ------------ /shared/apps/modulefiles ------------ gcc/gcc-4.9.4 gcc/gcc-7.5.0 gcc/gcc-10.3.0 gcc/gcc-5.5.0 gcc/gcc-8.5.0 gcc/gcc-11.3.0 gcc/gcc-6.5.0 gcc/gcc-9.5.0 gcc/gcc-12.1.0
Additional versions can be created and made available as modules as well. If you need a specific version that is not currently available, please ask us to compiler and install it. If necessary, we may be able to provide access to other compilers, for example LLVM. We do not provide access to proprietary compilers at this time.
MPI Libraries and Runtimes
While we seem to have a variety of MPI versions and flavors available to users, the only MPI versions that allow us to run software over Infiniband are the Intel MPI libraries. Some of the installed alternatives are likely to fail, or will have a set of environment variables that have to be set. All major engineering software packages that we offer are pre-configured with specific MPI versions and settings that have been tested and/or provided by the vendors.
Note: Some MPI libraries may seem to work. They may allow your MPI application to run. But inter-process network communication may travel through the rather slow and high-latency Ethernet fabric, making MPI applications very ineffective and are probably not worth while.
MatLab Runtimes
We can install MatLAB run time libraries as needed and have them available as loadable modules. Recently, we had a problem with MatLAB run time libraries being identified as security vulnerabilities. Contact us if you need them installed for one of your projects.
Anaconda and variants (miniconda etc)
Our current practice is to have users download and install their own versions of Anaconda and its variants in their own home directories. This allows for maximum flexibility when it comes to installable software modules, and users can maintain the installation, upgrades, and maintenance themselves. If you encounter issues, please contact us. One known side effect of Anaconda installations is a performance hit when starting your software, e.g. python scripts may take 30 seconds or more to execute. This is an artefact caused by the Lustre file system, which has been designed for large files accessible from many machines simultaneously. Performance on reading many small files has not been considered and is fairly poor. Again, contact us and we will design a solution for you as needed.