Job Submission and Monitoring: Difference between revisions
| Line 573: | Line 573: | ||
===Running Jobs using GPUs=== | ===Running Jobs using GPUs=== | ||
The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with GPUs for ML/AI applications. The cluster doesn't have a whole lot of GPUs at this point. We have three machines with three A4000 GOUs, a total of 9 GPUs. Then we have a much more powerful single machine with our A6000 GPUs. | The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with '''GPUs''' for '''ML/AI''' applications. The cluster doesn't have a whole lot of '''GPUs''' at this point. We have three machines with three '''A4000''' GOUs, a '''total of 9 A4000 GPUs'''. Then we have a much more powerful single machine with our '''four A6000 GPUs'''. | ||
Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast NVLINK between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and | Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast '''NVLINK''' between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and GPUs that have to jump between processors when communicating, and then the whole issue of having to go possibly through PCIe bridges. | ||
On our system, we are providing the ability to work mostly with individual GPUs. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time. | On our system, we are providing the ability to '''work mostly with individual GPUs'''. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time. | ||
<PRE> | <PRE> | ||
| Line 583: | Line 583: | ||
</PRE> | </PRE> | ||
With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the | With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the reserved GPU. | ||
To run a massive GPU job on 64 cores with 4 A6000 GPUs, submit the job like this: | To run a massive GPU job on 64 cores with 4 A6000 GPUs, submit the job like this: | ||
Revision as of 00:05, November 22, 2025
Resource Summary View
To get started, users can query the overall status of resources on the cluster. The "qsum" script will list all queues and nodes, as well as how many are offline, down, free, or assigned to users. This is a script developed by our team, and may need to be updated if something goes wrong. Please contact us if you experience any problems.
Each queue groups a number of nodes together based on their hardware and software configurations. Nodes can be part of more than one queue, and there are other complex details that we are ignoring here for the purpose of keeping it simple.
Queues
Here is a very brief summary of what each of the queues is, and how to use them efficiently:
- a4000
- This is a queue that has three 16-core CPU machines, each of which is furthermore equipped with three A4000 GPUs. That makes a total of 9 A4000 GPUs available to users. Neither the GPUs nor the processors are particularly powerful these days. The machines have 500GB of memory though, which makes for a good platform for experimenting with GPU capabilities.
- a6000
- This is a queue that has only one 64-core CPU machines, and is equipped with four A6000 GPUs. The system can be upgraded to 8 A6000 GPUs if needed. This is a decent GPU machine that can take a solid workload these days. The machine has 750GB of memory, which makes for a good production platform.
- amd16
- This is a queue with many of our older AMD-based 16-core machines, each of which has 30GB of memory. While individual machines are a bit outdated, they are all interconnected with Infiniband and can provide a solid production workload in multi-nodes jobs over MPI without blocking the more current (and thus expensive) systems.
- epyc1/epyc2
- These are 2 separate queues with slightly different performance characteristics. Each of the groups is interconnected with Infiniband to provide a platform for large and demanding software packages, such as LS-Dyna and StarCCM+. They have between 250GB and 500GB of memory. Because licenses for these software packages are very expensive, they should use these two queues for making optimum use of limited core licenses available to each package.
- xeon28
- This is a set of intermediate machines with 28 cores and 64GB of memory. They can be used for a variety of purposes, including MPI jobs and single node application software.
- virtual
- This is a set of nodes without MPI capabilities. They are virtual machines with 32GB each. They can be used for higher demand applications that would interfere with the login nodes, and therefore with other users of these login machines. A user would submit interactive jobs to individual virtual machines and avoid any significant load on login nodes.
The Queue Summary Script (qsum)
$ qsum
=============== a4000 ==========================================================
Queue: "a4000" / nodes: 3 / down: 0 / offline: 0 / busy: 0 / available: 3
AVAILABLE (3): g001, g002, g003
=============== a6000 ==========================================================
Queue: "a6000" / nodes: 1 / down: 0 / offline: 0 / busy: 0 / available: 1
AVAILABLE (1): lambda01
=============== amd16 ==========================================================
Queue: "amd16" / nodes: 33 / down: 2 / offline: 0 / busy: 2 / available: 29
DOWN (2): n017, n030
ley (2): n001, n002
AVAILABLE (29): n003, n004, n005, n006, n007, n008, n009, n010, n011, n012
n013, n014, n015, n016, n018, n019, n020, n021, n022, n023
n024, n025, n026, n027, n028, n029, n031, n032, n039
=============== epyc1 ==========================================================
Queue: "epyc1" / nodes: 1 / down: 0 / offline: 0 / busy: 0 / available: 1
AVAILABLE (1): a027
=============== epyc2 ==========================================================
Queue: "epyc2" / nodes: 20 / down: 0 / offline: 0 / busy: 5 / available: 15
ley (2): a030, a031
msitek (3): a028, a029, a032
AVAILABLE (15): a033, a034, a035, a036, a037, a038, a039, a040, a041, a042
a043, a044, a045, a046, a047
=============== virtual ========================================================
Queue: "virtual" / nodes: 6 / down: 0 / offline: 0 / busy: 0 / available: 6
AVAILABLE (6): v001, v002, v003, v004, v005, v006
=============== xeon28 =========================================================
Queue: "xeon28" / nodes: 12 / down: 0 / offline: 0 / busy: 0 / available: 12
AVAILABLE (12): p001, p002, p003, p004, p005, p006, p007, p008, p009, p010
p011, p012
================================================================================
Queue Status and Monitoring Jobs
qstat
To find out out about the status of all running jobs on the cluster you can use the "qstat" command. Here is an example:
$ qstat Nov 20 18:30 ley@login3:Plots$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 3023.pbs STDIN msitek 4144:14* R epyc2 3029.pbs STDIN ley 76:46:53 R epyc2 3032.pbs STDIN msitek 2879:52* R epyc2 3033.pbs STDIN msitek 3687:29* R epyc2 3048.pbs foo.sh james.cook 0 Q amd16 3060.pbs of13.sh ley 310:47:* R epyc2 3061.pbs of13.sh ley 308:37:* R epyc2 3062.pbs of13.sh ley 308:02:* R epyc2 3063.pbs of13.sh ley 308:15:* R epyc2
The first column shows the job id, a unique identifier for all jobs ever submitted to the cluster. This job id is important when killing jobs, or for other actions you may need to take.
The next column shows the name of the job script. If the column shows STDIN, it means that this is an interactive job where a user can enter commands in a terminal window. This is particularly useful for model and software development task where the application has to be started and killed repeatedly.
The owner of the job is shown next. These are the user names of the various people using the cluster.
The last three columns indicate the current run time of the job, whether it is running (R) or waiting (Q) for execution. The last entry shows the queue in which the job is running.
qstat -an1
Adding a few options gives much more detail about each jobs.
qstat -an1
Nov 20 13:09 ley@login3:Plots$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3023.pbs msitek epyc2 STDIN 24360* 1 64 350gb 100:0 R 81:46 a028/0*64
3029.pbs ley epyc2 STDIN 21719* 2 128 100gb 200:0 R 72:31 a030/0*64+a031/0*64
3032.pbs msitek epyc2 STDIN 18102* 1 64 350gb 100:0 R 57:57 a029/0*64
3033.pbs msitek epyc2 STDIN 830486 1 64 350gb 100:0 R 57:53 a032/0*64
3048.pbs james.c* amd16 foo.sh -- 1 28 30gb 01:00 Q -- --
3060.pbs ley epyc2 STDIN 763101 1 64 350gb 48:00 R 06:42 a033/0*64
3061.pbs ley epyc2 STDIN 763947 1 64 350gb 48:00 R 06:40 a034/0*64
3062.pbs ley epyc2 STDIN 761473 1 64 350gb 48:00 R 06:39 a035/0*64
3063.pbs ley epyc2 STDIN 766205 1 64 350gb 48:00 R 06:40 a036/0*64
In this table, you can see how many nodes and cores are being used by each job. For example, job 3029 of the user "ley" shows that it is running on 2 nodes using a total of 128 cores. In addition to the elapsed time, the table also show the reserved time for this job. This allow you to estimate when a job will be definitely finalized (or killed by the system if still running).
The last column (without a header) is written because the option "-an1" was used. This useful to learn about which nodes are used by each job.
qstat -q
To learn more about the queues on the cluster, the "-q" option turns out to be useful.
$ qstat -q
server: pbs
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
virtual 30gb -- -- 1 0 0 -- E R
a4000 500gb -- -- 1 0 0 -- E R
a6000 750gb -- -- 1 0 0 -- E R
xeon28 -- -- -- 4 0 0 -- E R
amd16 -- -- -- 8 0 1 -- E R
epyc2 -- -- -- 2 14 0 -- E R
epyc1 -- -- -- 2 0 0 -- E R
----- -----
14 1
For each queue, some basic values are displayed. The first three queues listed above have a default memory allocation as shown, and the column "Node" indicates the maximum number of nodes that can be asked for at job submission time. For example, you can request just one node for a job from the first three queues (because these are queues where MPI makes no sense). The "xeon28" queue also for a maximum of 4 nodes per MPI job. The "amd16" queue has a maximum of 8 nodes per job, and the "epyc1" and "epyc2" queues have maxima of two nodes per job. These limitations can be changed by the administrator as needed. As shown above, this will prevent inefficient resource requests.
qstat -f
The command "qstat -f -F json 3029" retrieves extremely detailed stats on the running job 3029. The result can be returned in JSON format to be ready for further processing (shown below).
$ qstat -f -F json 3029
{
"timestamp":1763705353,
"pbs_version":"23.06.06",
"pbs_server":"pbs",
"Jobs":{
"3029.pbs":{
"Job_Name":"STDIN",
"Job_Owner":"ley@login4",
"resources_used":{
"cpupercent":98,
"cput":"76:46:53",
"hpmem":"0b",
"mem":"52428800kb",
"ncpus":128,
"vmem":"52428800kb",
"walltime":"78:09:32"
},
"job_state":"R",
"queue":"epyc2",
"server":"pbs",
"Checkpoint":"u",
"ctime":"Mon Nov 17 17:58:25 2025",
"Error_Path":"/dev/pts/0",
"exec_host":"a030/0*64+a031/0*64",
"exec_vnode":"(a030:ncpus=64:mem=52428800kb)+(a031:ncpus=64:mem=52428800kb)",
"Hold_Types":"n",
"interactive":"True",
"Join_Path":"n",
"Keep_Files":"n",
"Mail_Points":"a",
"mtime":"Fri Nov 21 00:07:59 2025",
"Output_Path":"/dev/pts/0",
"Priority":0,
"qtime":"Mon Nov 17 17:58:25 2025",
"Rerunable":"False",
"Resource_List":{
"mem":"100gb",
"mpiprocs":128,
"ncpus":128,
"nodect":2,
"place":"free",
"select":"2:ncpus=64:mem=50gb:mpiprocs=64",
"walltime":"200:00:00"
},
"stime":"Mon Nov 17 17:58:25 2025",
"session_id":2171964,
"jobdir":"/mnt/lustre/arrow/home/ley",
"substate":42,
"Variable_List":{
"PBS_O_HOME":"/mnt/lustre/arrow/home/ley",
"PBS_O_LANG":"en_US.UTF-8",
"PBS_O_LOGNAME":"ley",
"PBS_O_PATH":"/shared/apps/active/lstc/lsprepost/SP-4.5:/shared/apps/active/lstc/lsprepost/DP-4.3:/shared/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/opt/thinlinc/bin:/opt/thinlinc/sbin:/mnt/lustre/arrow/home/ley/.local/bin:/mnt/lustre/arrow/home/ley/bin",
"PBS_O_MAIL":"/var/spool/mail/ley",
"PBS_O_SHELL":"/bin/bash",
"PBS_O_WORKDIR":"/mnt/lustre/arrow/home/ley/Qualification/LS-Dyna/Rocky9/Seatbelt/Template",
"PBS_O_SYSTEM":"Linux",
"PBS_O_QUEUE":"epyc2",
"PBS_O_HOST":"login4"
},
"comment":"Job run at Mon Nov 17 at 17:58 on (a030:ncpus=64:mem=52428800kb)+(a031:ncpus=64:mem=52428800kb)",
"etime":"Mon Nov 17 17:58:25 2025",
"run_count":1,
"Submit_arguments":"-I -q epyc2 -l walltime=200:00:00,select=2:ncpus=64:mem=50gb:mpiprocs=64",
"project":"_pbs_project_default",
"Submit_Host":"login4"
}
}
}
Manual pages for qstat
To learn more about the "qstat" command, you can use the command "man qstat", which will print a lot of detailed information about the capabilities of this command.
$ man qstat
qstat(1B) PBS Professional qstat(1B)
NAME
qstat - display status of PBS jobs, queues, or servers
SYNOPSIS
Displaying Job Status
Default format:
qstat [-E] [-J] [-p] [-t] [-w] [-x] [[<job ID> | <destination>] ...]
Long format:
qstat -f [-F json | dsv [-D <delimiter>]] [-E] [-J] [-p] [-t] [-w]
[-x] [[<job ID> | <destination>] ...]
... <many more pages>
Job Submission Basics
Jobs are submitted into the system using the "qsub" application. This application can take many different options and allows for a lot of different resource requests to tell the cluster what to do. We are running OpenPBS 23.06.06 as our job scheduler. Here is a link to the User's Manual (of PBS PRO) if you want to explore gory details and capabilities. The User's Guide has about 240 pages, the Reference Guide has 500 pages, and the Big Book has 2500 pages. So there is a lot of information available. I also added job submission info for the LCRC cluster.
- Argonne's LCRC pages on job submissions on their clusters
- PBS Professional 2022.1 User's Guide
- PBS Professional 2022.1 Reference Guide
- Altair PBS Professional 2022.1 Big Book
The User's Guide can be very helpful to clarify some of the concepts and capabilities, but it can be hard to find the specific information you may be looking for. Please understand that we are no longer running TORQUE and MAUI, so the syntax for job submission is distinctively different yet quite similar.
The reference guide may be helpful to understand the complete syntax and full capabilities of the software.
The big book is what I had to use when configuring OpenPBS earlier this year. This includes all the tricky details needed to make the system work smoothly for us. It's a bit scary to look at a PDF file that is 2500 pages long, but that is nothing compared to the StarCCM+ manuals.
IMPORTANT NOTE: The following sections are important to understand. They explain how jobs are submitted and then scjeduled for execution based on resources available and the specific need of the user.
The following sections explain the various tasks you may want to submit fir execution, ordered from simple to complex.
- General Batch Jobs
- Requesting a Single Node for a Job
- Requesting Multiple Nodes for a Job
- Embedded Job Resource Requests
- Interactive Jobs
- Interactive Jobs with X-Windows GUI Applications
- Running Multiple Jobs on Single Nodes
- Running Jobs using GPUs
General Batch Jobs
Let's get started with a very basic usage of the system. Let's assume you have a simple application, and you want to execute it on a cluster node. Let's also assume that this is a very simple application, one that runs on one or a few cores, doesn't require any keyboard interaction with the user, doesn't need the user to see what's typically written to the screen, and writes its output to files. In this case, we can submit this application as a batch job, which will place it into an execution queue and process it as soon as a node becomes available.
If the application requires more cores than a single node can provide, we can run the application over Infiniband with MPI message passing. In this case, we need to understand the concept of MPI applications a bit better. In both cases, we get started by creating a folder on the file system. Naming conventions are important so that you can distinguish the jobs by folder name.
For both of the above scenarios, you would typically create a Bash shell script, and then submit the script into one of the queues for eventual execution.
Requesting a Single Node for a Job
Let's try something rather trivial to get used to the concept. Create yourself a folder, for example "myjobfolder". Within that folder, create a job submission script. That script can have any name, but something short and simple may be best. Let's assume you create a file called "cluster.job". The file doesn't have to have that extension. Any file name will do. But using the same filename for all of your jobs helps finding your way around the many files that will be created over time. The "cluster.job" file should look something like this:
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # now we sleep for 60 seconds and waste time. This is a placeholder for your application, # which would be doing useful work for you. sleep 60 # # and after doing things, we may want to write something into a file to show that # our jobs is done. echo `date` > info.log #
This can be submitted without detailed resource specifications (except for the walltime, which is be default 0:00:00):
$ qsub -q virtual -l walltime=1:00:00 cluster.job 3072.pbs
Wait a little, the check the status of running jobs:
$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3023.pbs msitek epyc2 STDIN 24360* 1 64 350gb 100:0 R 83:17 a028/0*64
3029.pbs ley epyc2 STDIN 21719* 2 128 100gb 200:0 R 74:00 a030/0*64+a031/0*64
3033.pbs msitek epyc2 STDIN 830486 1 64 350gb 100:0 R 59:23 a032/0*64
3048.pbs james.c* amd16 foo.sh -- 1 28 30gb 01:00 Q -- --
3060.pbs ley epyc2 STDIN 763101 1 64 350gb 48:00 R 08:10 a033/0*64
3061.pbs ley epyc2 STDIN 763947 1 64 350gb 48:00 R 08:10 a034/0*64
3070.pbs ley epyc2 STDIN 766847 1 64 350gb 48:00 R 07:23 a042/0*64
3072.pbs ley virtual cluster.j* 230230 1 4 30gb 01:00 E 00:01 v001/0*4
In this particular example, we are sending this job to the queue "virtual". This queue, by default, allocates 30GB of memory to the job, and runs on 1 node with 4 cores. This is sufficient capacity to run quite a workload. When submitting a job to a single node, reasonable maximum allocations are automatically assigned, and the user doesn't have to worry about running out of memory or how many cores he will be using.
The only required argument is the "walltime" argument. By default, the job will quit as soon as it is submitted. This indicates to the user that he forgot to provide the "walltime" argument.
When the job disappears from the job list, it is done. At this point, you will find the file "info.log" in your job folder.
$ cat info.log Thu Nov 20 08:00:31 PM CST 2025
Requesting Multiple Nodes for a Job
To run jobs on multiple nodes, you will be likely executing jobs using MPI, the message passing interface. This establishes high-speed low-latency interconnections between the cores on one machine and the cores on the other machines. Data transfer does not require involvement of the cores themselves. Instead, the core tell the InfiniBand interconnect (and cores on the same node through shared memory) to transfer the data through RDMA, remoted direct memory access. The cores don't need to spend CPU cycles on copying data, but rather simply access the data once it has been copied by the Infiniband fabric. This makes for extremely efficient remote memory access, and message passing is used to coordinate data transfer between the cores no matter where they are located on any of the nodes.
On our cluster, MPI-aware applications like OpenFOAM, StarCCM+, and LS-Dyna can be loaded as modules, which then automatically selects the most appropriate MPI library to use. The software applications have been tested to ensure that they work out-of-the box if a user selects any specific version of any of the applications.
The following is a very trivial example for the MPI execution of a very simple executable, with one copy running on each core of the nodes allocated to the job. It doesn't perform any real work and just wastes resources for a short time, but it illustrates how execution on the cores of various nodes works.
Like in the previous section, we start with a simple job script that we submit to an appropriate queues. In this case, we pick a queue that has machines with Infiniband interfaces supporting efficient communications. Let's assume we edit a file with the name "parallel.job" like this:
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # to execute a simple command on all of the cores of all of the nodes allocated to the job, # we need to make one of the MPI versions available. Let's use one of the most up-to-date # MPI library available on the cluster module load intel/2024.2.0/mpi/2021.13 # # now we are apply a few settings that ensure that the MPI library will use the highest-performing # Infiniband Interconnect, as well as a few options to tell MPI how to interface nodes with # each other and which specific Infiniband adapter to use. This is complex and requires in-depth # knowledge of the QLogic Infiniband adapters we are using. It is unlikely that you will ever have to # deal with these options, because the "module load" command for the engineering applications we provide # on ARROW will handle all those details transparently without the user needing to understand the details. export I_MPI_HYDRA_BOOTSTRAP=ssh export MPI_DEVICE=rdma:ofa-v2-ib0 export UCX_NET_DEVICES=qib0:1 # # it doesn't make much sense, but in this example we are executing the OS command "uptime" on all cores # of the nodes allocated to this job. The output from each core is written to the file info.log. We # will find 56 lines of output in the file info.log, each created by the corresponding core executing # the uptime command. mpirun uptime > info.log #
A good queue to test scripts is the "xeon28" queue. In the queue, we have 2 14-core Xeon processers per node, so that means that each node has 56 actual cores. We do not consider hyperthreading when doing parallel computing. 56 actual cores is what's being used here. The job submission will look like this:
qsub -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G parallel.job
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | + --- the name of the job script to execute
| | | | | | + ----- don't forget to specify gigabytes
| | | | | + ------- the amount of memory to request per node
| | | | + -------------- the number of MPI tasks per nodes
| | | + -------------------------- the number of cores per node
| | + ---------------------------------- the number of nodes to select in the queue
| + ------------------------------------------------- the requested time, in this case 1h
+ --------------------------------------------------------------------- the queue to be used for the job
At this point, the job has created a file "info.log" with 56 lines, one per core:
22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:28, 0 users, load average: 0.00, 0.00, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 22:26:05 up 23 days, 9:53, 0 users, load average: 0.06, 0.03, 0.00 ...
In this simple example, the lines look all the same. Upon close examination through, you can find slightly different values for some of the lines. Some lines say that the machine is up for 23 days and 9:28, while others say 23 days and 9:53. Because all 28 cores of a node would see the same uptime of the server, half of the entries show one time stamp, and the other 28 cores show the other one. That demonstrates that the 56 processes have been running independently on 2 nodes.
Embedded Job Resource Requests
The job script can be modified to embed the resource requests in form of a series of #PBS statements at the beginning of the script file. This is a very common practice use at many HPC installations and job submission engines. Let's go back to the previous example where we run the script on two nodes in parallel. That is the "parallel.job" script file again:
#!/bin/bash # #PBS -q xeon28 #PBS -l walltime=1:0:0 #PBS -l select=2:ncpus=28:mpiprocs=28:mem=60G # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # to execute a simple command on all of the cores of all of the nodes allocated to the job, # we need to make one of the MPI versions available. Let's use one of the most up-to-date # MPI library available on the cluster module load intel/2024.2.0/mpi/2021.13 # # now we are apply a few settings that ensure that the MPI library will use the highest-performing # Infiniband Interconnect, as well as a few options to tell MPI how to interface nodes with # each other and which specific Infiniband adapter to use. This is complex and requires in-depth # knowledge of the QLogic Infiniband adapters we are using. It is unlikely that you will ever have to # deal with these options, because the "module load" command for the engineering applications we provide # on ARROW will handle all those details transparently without the user needing to understand the details. export I_MPI_HYDRA_BOOTSTRAP=ssh export MPI_DEVICE=rdma:ofa-v2-ib0 export UCX_NET_DEVICES=qib0:1 # # it doesn't make much sense, but in this example we are executing the OS command "uptime" on all cores # of the nodes allocated to this job. The output from each core is written to the file info.log. We # will find 56 lines of output in the file info.log, each created by the corresponding core executing # the uptime command. mpirun uptime > info.log #
If the resource requests are embedded within the file, they don't have to be specified on the command line any longer (the command line overrides the embedded specifications though). This may be convenient, because all the user has to do for job submission is the following:
qsub parallel.job
Here is an example with more resource specifications and job settings that affect the behavior of the job:
#!/bin/bash # #PBS -q xeon28 #PBS -l walltime=1:0:0 #PBS -l select=2:ncpus=28:mpiprocs=28:mem=60G #PBS -A Account #PBS -j oe #PBS -N JobName #PBS -e log.error #PBS -o log.output #PBS -m bae # ...
I leave this to you as an exercise to figure out what the various options mean and how to specify them. There are many more, all documented in the PBS PRO manual (see above). Most of them are not terribly relevant and can be omitted.
Interactive Jobs
On ARROW, we don't restrict queues to be used only in batch mode. While batch mode is efficient for lining up a lot of work to be executed one after the other, ARROW has been designed to allow efficient model and software development in interactive mode. We have always ensured to have more computers than minimally needed to make it possible to dedicate resources to developers as needed, even if that means wasted CPU cycles. At times, we may ask you to limit the number of interactive jobs so that a large batch workload can be processed efficiently. This happens from time to time, and we have our users coordinate this with each other.
Let's assume that you are developing an MPI application, or you are working on a complex OpenFOAM model that requires to start parallel processes over and over again just to find a bug and then fix it quickly. To do that, you can request an interactive job by adding the "-I" option to the job submission command (this is an uppercase I). Let's go to the parallel multi-node example from above:
qsub -I -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | + --- don't forget to specify gigabytes
| | | | | | + ----- the amount of memory to request per node
| | | | | + ------------ the number of MPI tasks per nodes
| | | | + ------------------------ the number of cores per node
| | | + -------------------------------- the number of nodes to select in the queue
| | + ----------------------------------------------- the requested time, in this case 1h
| + ------------------------------------------------------------------- the queue to be used for the job
+ ------------------------------------------------------------------------ request an interactive job <<===
When running interactive jobs with the "-I" parameter, we don't specify av job script at the end of the submission command. The interactive job will instead start (once the nodes are available) in interactive mode, meaning that the terminal session changes over from being a series of commands executed on the login server to being a series of commands being executed on the first node of the group of nodes that are allocated to the job. At this point, you can change to the desired working directory, but what you do with the allocated resources is entirely up to you. You can load modules, including MPI libraries, and then issue the commands for your application interactively and see how they execute. If you start an "mpirun", the cores on your allocated secondary node will work as expected. There is no difference to batch mode, other than you having the ability to execute lines of commands at will.
Interactive Jobs with X-Windows GUI Applications
Interactive use can go further than that. With some of our software applications, like StarCCM+, you can run an interactive GUI application where you control the computational work from within the applications' GUI. Within the GUI, you can control execution of the numerical solver that runs on as many cores as you requested, while being able to reconfigure the case through the GUI as well. Furthermore, you can visualize developing results on the fly by creating complex plots and visualizations.
All that is need is an option "-X" being used as part of the job submission, like this:
qsub -X -I -q xeon28 -l walltime=1:0:0 -l select=2:ncpus=28:mpiprocs=28:mem=60G
^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | + --- don't forget to specify gigabytes
| | | | | | | + ----- the amount of memory to request per node
| | | | | | + ------------ the number of MPI tasks per nodes
| | | | | + ------------------------ the number of cores per node
| | | | + -------------------------------- the number of nodes to select in the queue
| | | + ----------------------------------------------- the requested time, in this case 1h
| | + ------------------------------------------------------------------- the queue to be used for the job
| + ------------------------------------------------------------------------ request an interactive job
+ --------------------------------------------------------------------------- request GUI capabilities <<===
Running Multiple Jobs on Single Nodes
A feature that is new on ARROW is the ability to run multiple jobs on a single node. Let's assume that you are performing a sensitivity analysis on an existing model, and the model is simple enough to return results within a reasonable time on just a few cores of a higher end machine (maybe you are running SMP versions of LS-Dyna). Our high end machines have 64 cores, so lets assume you have an LS-Dyna model that runs well on 8 cores and doesn't use a whole lot of memory. In this case, you can submit individual jobs that request simply 8 cores and a fraction of the available memory available on the node, and all jobs execute independently from each other. Each job is fit into a slot where available. It is not very different from using whole nodes for everything. The important consideration is that each job is cleanly constrained into it's allotted resources using the CGROUPS functionality of modern operating systems. Because an abusive user cannot use more cores or more memory than allocated to his job, other users can safely run smaller jobs on the same node.
Lets assume that we have a number of smaller jobs that we want to run on a single node in the "xeon28" queue. Each job would be submitted by using reduced resources that allow for sharing but that guarantee that the jobs will be run successfully. In this case, you can submit many jobs in the following manner (with a job script for the small jobs, each of which can request varying resources if needed - some may want to run on 5 cores, others on 3):
#!/bin/bash # # the following ensures that you will change into the directory where you are # submitting the job from. cd $PBS_O_WORKDIR # # now we sleep for 300 seconds and waste time. This is a placeholder for your application, # which would be doing useful work for you. sleep 300 #
Now we submit a variety of these jobs (11 total in this example) to the "xeon28" queue for execution (note that the first few jobs request different amounts of memory and core counts):
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=12:mpiprocs=12:mem=5G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=10:mpiprocs=10:mem=7G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=8:mpiprocs=8:mem=9G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=16:mpiprocs=16:mem=20G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
qsub -q xeon28 -l walltime=1:0:0 -l select=1:ncpus=2:mpiprocs=2:mem=2G small.job
/PRE>
They are now running in the order of submission, allocated on as few nodes in the "xeon28" queue as necessary. Only 2 nodes are being loaded quite heavily, and 4 more cores are in use on a third node.
<PRE>
Nov 20 23:34 ley@login3:myjobfolder$ qstat -an1
pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3082.pbs ley xeon28 small.job 813221 1 12 5gb 01:00 R 00:01 p001/0*12
3083.pbs ley xeon28 small.job 813288 1 10 7gb 01:00 R 00:01 p001/1*10
3084.pbs ley xeon28 small.job 671792 1 8 9gb 01:00 R 00:01 p002/0*8
3085.pbs ley xeon28 small.job 671845 1 16 20gb 01:00 R 00:01 p002/1*16
3086.pbs ley xeon28 small.job 813361 1 2 2gb 01:00 R 00:00 p001/2*2
3087.pbs ley xeon28 small.job 813413 1 2 2gb 01:00 R 00:00 p001/3*2
3088.pbs ley xeon28 small.job 813464 1 2 2gb 01:00 R 00:00 p001/4*2
3089.pbs ley xeon28 small.job 671912 1 2 2gb 01:00 R 00:00 p002/2*2
3090.pbs ley xeon28 small.job 671969 1 2 2gb 01:00 R 00:00 p002/3*2
3091.pbs ley xeon28 small.job 632092 1 2 2gb 01:00 R 00:00 p003/0*2
3092.pbs ley xeon28 small.job 632100 1 2 2gb 01:00 R 00:00 p003/1*2
This is a particularly effective strategy to run concurrently many cases that don't scale well beyond a few cores. When running them on fewer cores but many of them at the same time, the overall processing rate will be much higher than executing them the traditional way.
Running Jobs using GPUs
The principle of running multiple jobs on a single node becomes particularly important when using servers equipped with GPUs for ML/AI applications. The cluster doesn't have a whole lot of GPUs at this point. We have three machines with three A4000 GOUs, a total of 9 A4000 GPUs. Then we have a much more powerful single machine with our four A6000 GPUs.
Using multiple GPUs in a single application is still something where the software has to be designed with hardware in mind. GPUs have several methods of communicating with each other, e.g. very fast NVLINK between pairs of GPUs, GPUs being directly connected to one of the two CPUs in the system and thus being able to communicate faster, and GPUs that have to jump between processors when communicating, and then the whole issue of having to go possibly through PCIe bridges.
On our system, we are providing the ability to work mostly with individual GPUs. Users can also reserve entire nodes and develop or run applications that are adapted to that hardware, including several GPUs installed on that node. One thing we do not provide is the ability of GPU to GPU communication between nodes. Thus, a job cannot request more than one GPU node at a time.
qsub -q a4000 -I -l walltime=1:0:0 -l select=1:ncpus=5:mem=150G:ngpus=1
With these specifications, three single GPU jobs can run on a single server. Each job sees only one of the reserved GPU.
To run a massive GPU job on 64 cores with 4 A6000 GPUs, submit the job like this:
qsub -q a6000 -I -l walltime=1:0:0 -l select=1:ncpus=64:mem=725G:ngpus=4
Manual pages for qsub
To learn more about the "qsub" command, you can use the command "man qsub", which will print a lot of detailed information about the capabilities of this command.
$ man qsub
qsub(1B) PBS Professional qsub(1B)
NAME
qsub - submit a job to PBS
SYNOPSIS
qsub [-a <date and time>] [-A <account string>] [-c <checkpoint spec>]
[-C <directive prefix>] [-e <path>] [-f] [-h]
[-I [-G [-- <GUI application/script>]] | [-X]] [-j <join>]
[-J <range> [%<max subjobs]] [-k <discard>] [-l <resource list>]
[-m <mail events>] [-M <user list>] [-N <name>] [-o <path>]
[-p <priority>] [-P <project>] [-q <destination>] [-r <y|n>]
[-R <remove options>] [-S <path list>] [-u <user list>]
[-v <variable list>] [-V] [-W <additional attributes>] [-z]
[- | <script> | -- <executable> [<arguments to executable>]]
qsub --version
DESCRIPTION
You use the qsub command to submit a batch job to PBS. Submitting a PBS job specifies a task, requests resources, and
sets job attributes.
... <many more pages>
LS-Dyna on the ARROW Cluster
Currently Available LS-Dyna Versions
The following is a list of LS-Dyna versions available on ARROW after the latest reconfiguration of the system. Versions below 11.2.2 are no longer available because they are not compatible with modern operating systems and cannot be made to work correctly.
All versions are loaded using the "module load" command. Versions can be listed with the "module avail ls-dyna" command. To load one of the modules, use the following syntax:
module load ls-dyna/14.2.0/mpi-d8-ifort190-avx512
^ ^ ^ ^ ^ ^
| | | | | + --- specify the extended instruction set needed for execution
| | | | + ------------ load the version of the compiler that was used to create this
| | | + --------------- load the version that supports double precision variables
| | + ------------------- load the MPP (MPI) version of LS-Dyna
| + -------------------------- load specifically version 14.2.0
+ ---------------------------------- load a version of LS-Dyna
The version string is composed of multiple elements to indicate variants in compilers and compiler options. Use the following guideline to choose an appropriate version to load:
- "1" or "mpi" indicates whether this is a single node version of LS-Dyna (SMP) or whether this is a multi-node MPI version (MPP). All MPI versions use the IntelMPI 2022 libraries which have been tested thoroughly on ARROW. MPI versions will use the Infiniband Network of ARROW for high-speed and low-latency inter-process communication using RDMA (remote direct memory access).
- All LS-Dyna versions are available in either floating point or double precision variants. Floating point variants use 4 bytes to represent a value, and double precision variants use 8 bytes. There are pros and cons for choosing one over the other variant. With regards to computational efficiency, both perform nearly the same because all machines are equipped with 64-bit CPUs.
- "f4" floating point versions
- Pros: These require significantly less memory to run. Results occupy less disk space, and can be transferred significantly faster into and out of ARROW.
- Cons: The numerical resolution is limited to 7 significant digits, which is often undesirable when dealing with mathematical operations on small and large numbers at the same time.
- "r8" double precision versions
- Pros: The numerical resolution is about twice the number of significant digits compare to "f4", which helps when when dealing with mathematical operations on small and large numbers at the same time.
- Cons: These require more memory to run. Results occupy more disk space, and it takes longer to transfer data into and out of ARROW.
- "f4" floating point versions
- There are two more identifiers to choose from when it comes to the variants of the executables: the specific compiler used to create the executable and the specific processor instruction set required for running the executable.
- For modern versions of LS-Dyna, two compilers have been used by the developers to create LS-Dyna executables: the Intel Fortran Compiler and the AOCC (AMD Optimizing C/C++ and Fortran) compiler. Both variants of the software are supported on ARROW. This gives users the opportunity to choose an alternate variant of the same LS-Dyna version when running into bugs or crashes.
- The variants based on the various instruction set extensions (SSE2, AVX2, AVX512, and so on) gives users even more options when choosing an alternate LS-Dyna variant of the same version when running into bugs or crashes. These instruction sets are mostly related to performance gains on specific processors. We have not performed thorough performance tests and cannot recommend specific versions right now.
$ module avail ls-dyna --------------------------------------------- /shared/apps/modulefiles --------------------------------------------- ls-dyna/11.2.2/mpi-d8-ifort160-avx2 ls-dyna/13.0.0/mpi-f4-ifort190-sse2 ls-dyna/15.0.2/mpi-d8-ifort190-avx512 ls-dyna/11.2.2/mpi-d8-ifort160-sse2 ls-dyna/13.1.0/mpi-d8-aocc310-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-sse2 ls-dyna/11.2.2/mpi-f4-ifort160-avx2 ls-dyna/13.1.0/mpi-d8-ifort190-avx2 ls-dyna/15.0.2/mpi-f4-aocc400-avx2 ls-dyna/11.2.2/mpi-f4-ifort160-sse2 ls-dyna/13.1.0/mpi-d8-ifort190-sse2 ls-dyna/15.0.2/mpi-f4-ifort190-avx2 ls-dyna/12.1.0/1-d8-ifort160 ls-dyna/13.1.0/mpi-f4-aocc310-avx2 ls-dyna/15.0.2/mpi-f4-ifort190-avx512 ls-dyna/12.1.0/1-f4-aocc310 ls-dyna/13.1.0/mpi-f4-ifort190-avx2 ls-dyna/15.0.2/mpi-f4-ifort190-sse2 ls-dyna/12.1.0/1-f4-ifort160 ls-dyna/13.1.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/1-d8-aocc420-avx2 ls-dyna/12.1.0/mpi-d8-aocc310-avx2 ls-dyna/13.1.1/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/1-d8-aocc420-avx512 ls-dyna/12.1.0/mpi-d8-ifort160-avx2 ls-dyna/13.1.1/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/1-d8-ifort190-sse2 ls-dyna/12.1.0/mpi-d8-ifort160-sse2 ls-dyna/13.1.1/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/1-f4-aocc420-avx2 ls-dyna/12.1.0/mpi-f4-aocc310-avx2 ls-dyna/13.1.1/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/1-f4-aocc420-avx512 ls-dyna/12.1.0/mpi-f4-ifort160-avx2 ls-dyna/14.0.0/1-d8-ifort190 ls-dyna/16.0.0/1-f4-ifort190-sse2 ls-dyna/12.1.0/mpi-f4-ifort160-sse2 ls-dyna/14.0.0/1-f4-ifort190 ls-dyna/16.0.0/mpi-d8-aocc420-avx2 ls-dyna/12.2.0/1-d8-ifort160 ls-dyna/14.0.0/mpi-d8-aocc310-avx2 ls-dyna/16.0.0/mpi-d8-aocc420-avx512 ls-dyna/12.2.0/1-f4-ifort160 ls-dyna/14.0.0/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-ifort190-avx2 ls-dyna/12.2.0/mpi-d8-aocc400-avx2 ls-dyna/14.0.0/mpi-d8-ifort190-sse2 ls-dyna/16.0.0/mpi-d8-ifort190-avx512 ls-dyna/12.2.0/mpi-d8-ifort160-avx2 ls-dyna/14.0.0/mpi-f4-ifort190-avx2 ls-dyna/16.0.0/mpi-d8-ifort190-sse2 ls-dyna/12.2.0/mpi-d8-ifort160-sse2 ls-dyna/14.0.0/mpi-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-f4-aocc420-avx2 ls-dyna/12.2.0/mpi-f4-aocc400-avx2 ls-dyna/14.1.0/1-d8-ifort190-sse2 ls-dyna/16.0.0/mpi-f4-aocc420-avx512 ls-dyna/12.2.0/mpi-f4-ifort160-avx2 ls-dyna/14.1.0/1-f4-ifort190-sse2 ls-dyna/16.0.0/mpi-f4-ifort190-avx2 ls-dyna/12.2.0/mpi-f4-ifort160-sse2 ls-dyna/14.1.0/mpi-d8-aocc400-avx2 ls-dyna/16.0.0/mpi-f4-ifort190-avx512 ls-dyna/12.2.1/1-d8-ifort160-sse2 ls-dyna/14.1.0/mpi-d8-ifort190-avx2 ls-dyna/16.0.0/mpi-f4-ifort190-sse2 ls-dyna/12.2.1/1-f4-ifort160-sse2 ls-dyna/14.1.0/mpi-d8-ifort190-avx512 ls-dyna/16.1.0/mpi-d8-aocc420-avx2 ls-dyna/12.2.1/mpi-d8-aocc400-avx2 ls-dyna/14.1.0/mpi-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-d8-aocc420-avx512 ls-dyna/12.2.1/mpi-d8-ifort160-avx2 ls-dyna/14.1.0/mpi-f4-aocc400-avx2 ls-dyna/16.1.0/mpi-d8-ifort190-avx2 ls-dyna/12.2.1/mpi-d8-ifort160-sse2 ls-dyna/14.1.0/mpi-f4-ifort190-avx2 ls-dyna/16.1.0/mpi-d8-ifort190-avx512 ls-dyna/12.2.1/mpi-f4-aocc400-avx2 ls-dyna/14.1.0/mpi-f4-ifort190-avx512 ls-dyna/16.1.0/mpi-d8-ifort190-sse2 ls-dyna/12.2.1/mpi-f4-ifort160-avx2 ls-dyna/14.1.0/mpi-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-f4-aocc420-avx2 ls-dyna/12.2.1/mpi-f4-ifort160-sse2 ls-dyna/14.2.0/1-d8-ifort190-sse2 ls-dyna/16.1.0/mpi-f4-aocc420-avx512 ls-dyna/12.2.2/1-d8-ifort160-sse2 ls-dyna/14.2.0/1-f4-ifort190-sse2 ls-dyna/16.1.0/mpi-f4-ifort190-avx2 ls-dyna/12.2.2/1-f4-ifort160-sse2 ls-dyna/14.2.0/mpi-d8-aocc400-avx2 ls-dyna/16.1.0/mpi-f4-ifort190-avx512 ls-dyna/12.2.2/mpi-d8-aocc400-avx2 ls-dyna/14.2.0/mpi-d8-ifort190-avx2 ls-dyna/16.1.0/mpi-f4-ifort190-sse2 ls-dyna/12.2.2/mpi-d8-ifort160-avx2 ls-dyna/14.2.0/mpi-d8-ifort190-avx512 ls-dyna/16.1.1/mpi-d8-aocc420-avx2 ls-dyna/12.2.2/mpi-d8-ifort160-sse2 ls-dyna/14.2.0/mpi-d8-ifort190-sse2 ls-dyna/16.1.1/mpi-d8-aocc420-avx512 ls-dyna/12.2.2/mpi-f4-aocc400-avx2 ls-dyna/14.2.0/mpi-f4-aocc400-avx2 ls-dyna/16.1.1/mpi-d8-ifort190-avx2 ls-dyna/12.2.2/mpi-f4-ifort160-avx2 ls-dyna/14.2.0/mpi-f4-ifort190-avx2 ls-dyna/16.1.1/mpi-d8-ifort190-avx512 ls-dyna/12.2.2/mpi-f4-ifort160-sse2 ls-dyna/14.2.0/mpi-f4-ifort190-avx512 ls-dyna/16.1.1/mpi-d8-ifort190-sse2 ls-dyna/13.0.0/1-d8-ifort190 ls-dyna/14.2.0/mpi-f4-ifort190-sse2 ls-dyna/16.1.1/mpi-f4-aocc420-avx2 ls-dyna/13.0.0/1-f4-ifort190 ls-dyna/15.0.2/1-d8-ifort190-sse2 ls-dyna/16.1.1/mpi-f4-aocc420-avx512 ls-dyna/13.0.0/mpi-d8-ifort190-avx2 ls-dyna/15.0.2/1-f4-ifort190-sse2 ls-dyna/16.1.1/mpi-f4-ifort190-avx2 ls-dyna/13.0.0/mpi-d8-ifort190-sse2 ls-dyna/15.0.2/mpi-d8-aocc400-avx2 ls-dyna/16.1.1/mpi-f4-ifort190-avx512 ls-dyna/13.0.0/mpi-f4-ifort190-avx2 ls-dyna/15.0.2/mpi-d8-ifort190-avx2 ls-dyna/16.1.1/mpi-f4-ifort190-sse2
Submitting an LS-Dyna Job
IMPORTANT NOTE: The job/queue manager can track the number of LS-Dyna licenses to some degree. If all LS-Dyna users cooperate and use a script like the one shown below when submitting their jobs, the total number of concurrent LS-Dyna licenses will be tracked by the job manager correctly. That means that users can submit any number of LS-Dyna jobs, and jobs will only start when a sufficient number of licenses is available. This is managed by the "dynalic" resource at the end of the select statement. In this example, a 2-node job on 64-core nodes will need a total of "dynalic=128" licenses. This accounting breaks down when users don't use the "dynalic=XXX" statement, or when they don't calculate the number of licenses correctly. In that case, LS-Dyna jobs of all users are subject to sudden failure when LS-Dyna licenses run out. Please understand the importance of this specific setting in your job.
Furthermore, careful consideration should be given with regards to choice of resources for an LS-Dyna job. With 64 cores available on a single node in the "epyc1" and "epyc2" queues, it may be counterproductive to run a job on two nodes instead of a single node. Users should run their jobs with different numbers of nodes and determine whether performance increases. It may well decrease when running a job on two or more nodes. The outcome of such tests will tell what the best allocation of resources will be.
Most user use a job script like the following. All methods for job submission the the previous chapters apply as well, so there is a lot of flexibility:
#!/bin/bash # #PBS -q epyc1 #PBS -l walltime=12:0:0 #PBS -l select=2:ncpus=64:mpiprocs=64:mem=225G:dynalic=128 #PBS -N JobName #PBS -e log.error #PBS -o log.output # cd $PBS_O_WORKDIR # module load ls-dyna/12.2.1/mpi-f4-ifort160-avx2 module load dynamore/current # mpirun ls-dyna i=main.k memory1=300m memory2=100m # # when using the Dynamore tools, you can start something like this at the end DM.plotcprs.lnx -merge #