Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Batch Processing

Batch, or batch processing, here will refer to the capability of running jobs through an intermediate (software) subsystem, and not directly through the UNIX kernel. The latter occurs, for instance, when you run a command from the login prompt. A job is a collection of related processes. Batch mode execution provides for a controlled process over job scheduling and resource allocation. The LoadLeveler (LL), the IBM batch system on hydra, and other such systems, accomplishes this control by defining various queues or "classes" (as LL calls them). These are a major component of the software infrastructure for managing the arrival, scheduling, and execution of jobs. Each queue is configured with a set of attributes such as name, queue priority, queue resource limits, and job run limits. The batch system allows users to overcome resource limits imposed on interactive (login) processing and to regulate the execution flow of jobs on the basis of specific policies.

Before proceeding further, please make sure you are familiar with the material of the Basic System Information for hydra.

In order to run a job, you must first construct a job file. That is a text file that contains Unix executables and LL directives which specify the job's features and run characteristics. Upon submission, LL assigns your job a job id (and a step id, but more on that later), places it in one of the queues or "classes", assigns it a scheduling priority, and appropriately executes it when its turn comes. Below we provide adequate information about batch files (through samples), basic job submission and monitoring, environment variables, LL commands, classes, and other related information. We proceed mostly through examples. To state the obvious, this section covers only the most common aspects of batch processing using LL. Much more can be gotten from LL's reference manual listed at our IBM Documentation Page

Job Files

To run a batch job, you first construct a job file. That is a text file with one (the typical case at SC) or more job steps. A job step consists of various command keywords (or parameters) with (usually) corresponding values, comments, and executables-including standard Unix commands. A text line in a job file that begins with #@ signifies the start of a command keyword. A command keyword line can be continued to the next line with a slash, (/). A line with a command keyword either sets the keyword to a value or signifies some direct action. A comment line begins with a #. Other non-null lines must either consist of an executable, a Unix command line, or any number of combinations of the two. Below, we first show the typical format of a one-step job file, and after that a two-step one. You can easily generalize the latter to a multi-step job file.

# ------------ A 1-step Job File ----------
#@ command keyword    = $(var)
#@ command keyword    = string1$(var)string2 ...
#@ command keyword    = appropriate value(s) in appropriate syntax
# a comment line
#@ ... other command keywords ...
#
#@ queue
#
executable or Unix Commands
... other executables or Unix commands ...

Command keywords or parameters are reserved LL entities. Here are some of them: shell, initialdir, input, output, resources, environment, queue. You cannot define new ones. In the cited examples further below you find the most commonly used ones, their syntax, and an explanation. The $(var) format yields the value of var, where var is an LL preset variable such as: home, host, jobid, stepid, schedd_host, user. But var can also be a previously set command keyword such as: comment, class, job_name, step_name. You can concatenate any number of appropriate $(var)'s and strings for the appropriate command keyword. Most command keywords have default values. These too are indicated in the examples below. The the last command keyword in any step must be the #@queue. In the case of a 1-step job the jobid is the same with the stepid.

More job file examples can be found here.

# -------- Step 1 of a 2-step Job File --------
#@ command keyword    = $(var)
#@ command keyword    = string1$(var)string2 ...
#@ command keyword    = appropriate value(s) in appropriate syntax
# a comment line
#@ ... other command keywords ...
#
#@ queue

# ------- Step 2 of a 2-step Job File -------
#@ command keyword    = $(var)
#@ command keyword    = string1$(var)string2 ...
#@ ... other command keywords ...
#
#@ queue
#
executable or Unix Commands
... other executables or Unix commands ...

The set of executable and/or UNIX commands in the job file are shared among both job steps. The above 2-step job file, when scheduled, is assigned by LL a job id ($(jobid)) and two step id's:

llsubmit multistep.job
llsubmit: Processed command file through Submit Filter: "/home/loadl/job_filter.pl".
llsubmit: The job "f1n9.1234" with 2 job steps has been submitted.

When not specified in the currently active step, a command keyword inherits its value from previous job steps. The important thing to keep in mind about a multi-step job is that its steps are sequentially scheduled. A job step, however, may be bypassed and not execute if its execution depends upon the successful completion of a prior step. Another exception to the sequential execution of steps is coscheduling two or more steps concurrently. For more information on this contact the helpdesk or see the Load Leveler manual.

An example of a multi-step job file can be found here.

Example Job File and Detailed Explanation

The following is a sample MPI batch job file for the LL batch system. It illustrates many of the issues we discussed above:

#------------ Example job file: sample_mpi.job --------------
#@ shell                = /bin/ksh
#@ comment              = 32-proc, single node MPI job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = p_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ environment          = MP_SINGLE_THREAD=yes
# Specifies, per task/process, one cpu/core and 0.5 giga-bytes of memory
#@ resources            = ConsumableCpus(1) ConsumableMemory(500mb)
#@ wall_clock_limit     = 00:50:00
#@ network.MPI_LAPI     = sn_single, shared, US
#@ node                 = 2
# Specifies, per node, 16 tasks (e.g., MPI processes) and, therefore, 16*500mb = 8 GB of memory
#@ tasks_per_node       = 16
#@ notification         = always
#@ queue
#
poe ./p_poisson_1d.exe -shared_memory yes
#
/usr/local/bin/jobinfo
#------------------ End of sample_mpi.job -------------------

A detailed description of this job's attributes follows:

LL Command Keyword and Value Description
#@ shell = /bin/ksh The job will be executed under the ksh shell, the recommended shell for batch job scripts. The Korn shell is the standard Unix shell on hydra.
#@ comment = 32-proc, single node MPI job The comments for the batch job file.
#@ initialdir = $(home)/mpi/jacobi Sets the initial work directory for the job to /home/username/mpi/jacobi. When not specified, LL sets it by default to the directory you submitted the job from. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified directory.
#@ job_name = p_poisson_1d Sets the LL job name to p_poisson_1d
#@ input = $(initialdir)/input.dat Sets the job's standard input to /home/username/mpi/jacobi/input.dat. When not set, LL sets it to /dev/null. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified file.
#@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid) Sets the job's standard error to p_poisson_1d.oscheddhost.jjjj.ssss, where:
  • scheddhost is either of the LL scheduler nodes f1n2 or f1n9
  • jjjj is the job id
  • step id is ssss.
The above value for the standard error file, which LL returns to you upon termination, is recommended by the staff; otherwise, you can set it to anything you like (e.g., badnews.txt)
#@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid) Sets the job's standard output to the same file as standard error (see above). That is, in this job, std out and std error are concatenated.
#@ job_type = parallel Instructs that this job is parallel; i.e., it activates POE. Another setting is serial. The default setting is serial if job_type is not specified.
#@ environment = MP_SINGLE_THREAD=yes It sets the poe enviroment variables MP_SINGLE_THREAD (see man poe). Use semicolon (";") to separate multiple specifications, which are also possible. This is the correct place to set any environment variable whose value must be propagated to all the tasks and/or nodes associated with a job. Any environment variables appearing and set on this line before a COPY_ALL (see farther below) and also specified in your login control files, will have their values set to those of the login process.
#@ resources = ConsumableCpus(1) ConsumableMemory(500mb) This LL directive pertains to the allocation of resources per-task. Here it allocates one Power5+ CPU and 500mb of memory per MPI process or task. This is the standard setting (1 cpu) in an MPI program. If MPI processes also spawn OpenMP threads, however, then set, say, resources = ConsumableCpus(4) ConsumableMemory(1500mb) per task, where 1500mb is the aggregate amount of memory taken up by 4 threads, and where OMP_NUM_THREADS=4. Pure OpenMP programs should also use OMP_NUM_THREADS=4. The value of OMP_NUM_THREADS cannot exceed 16. Note that in setting memory you are only allowed to use integral values: 1500mb, not 1.5gb; 2gb, not 2.0gb; 6500kb, not 6.5mb.
#@ wall_clock_limit = 00:50:00 Sets the wall clock time for the duration of the job to 50 minutes. The format here is hh:mm:ss
#@ network.MPI_LAPI = sn_single, shared, US Specifies for a parallel job the use of the MPI and LAPI communication protocols for an HPS adapter, with communication across a single switch network adapter (of the HPS) and that adapter be used in a shared mode, with the us (user space) communication mode in effect. IMPORTANT: The network keyword is ONLY necessary for multi-node MPI jobs.
#@ node = 2 Specifies that the job use two p575 nodes. A POE option. The default value is 1 if the directive is not specified.
#@ tasks_per_node = 16 Specifies that the job use 16 tasks (MPI processes here) per node. This is the maximum per node and the recommended number for MPI jobs of 16 or more processes. A POE option. The default value is 1 if the directive is not specified.
#@ notification = always LL will report to user via e-mail on hydra the job's start, completion, and any encountered errors. Other values are error, complete (default), never, and start
#@ queue Schedule job for execution by routing it to the appropriate queue. A required command keyword
poe ./p_poisson_1d.exe -shared_memory yes This is the area for specifying UNIX commands and executables. Here POE executes the p_poisson_1d.exe binary using shared memory communication protocol for processes within a node, but otherwise other POE parameters (e.g., node, procs, tasks_per_node) assume the values set through command keywords associated with POE. A POE parameter specified via an LL command keyword, say, #@node = nn, supercedes any corresponding settings on the POE command line itself.
/usr/local/bin/jobinfo This command is a custom program to return the resource utilization for a job when the jobinfo program is executed. See this link for more information.

IMPORTANT: Correct Resource Specifications

The total number of physical cpu cores per node is 16. The product of the tasks per node and the consumable cpus per task MUST BE no more than 16.

For 49 of the nodes, the total available memory per node for applications is 25GB, NOT 32GB. The other 3 nodes have 57GB of memory available to applications with 7 GB reserved. The deficit (~7GB) is taken up by the operating system. See the llstatus -R command below. Hence, a 16-task job requesting only one node and specifying 2gb of consumable memory will not run on the majority of the nodes. For a 16-task job, the specification,

#@ resources         = ConsumableCpus(1) ConsumableMemory(2gb)

is wrong when requesting only one node. The 2gb must be reduced to around 1500mb (1.5 gigabytes, specifiable only as 1500mb, however), or the tasks_per_node and node number specifications must change accordingly. Please reread carefully the explanation given above on the resources LL directive. Read also statements about the jobinfo command.

Job Submission & Monitoring

Use the llsubmit command to submit your job as shown below:

$ llsubmit sample_mpi.job
llsubmit: Processed command file through Submit Filter: "/home/loadl/job_filter.pl".
llsubmit: The job "f1n2.338" has been submitted.

When submitted, this job is given a unique job ID, f1n2.338. Each step for a job will share the same job ID. The above is a single-step job. Hence, the job's ID is the same as the step ID for that job. You can use the llq command with various options to monitor your job's progress, as shown below:

$ llq
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
f1n9.245.0               user1       1/29 10:52 R  50  mpi64        f4n6
f1n2.338.0               user2       1/29 10:54 R  50  mpi32        f3n5

2 job step(s) in queue, 0 waiting, 0 pending, 2 running, 0 held, 0 preempted

$ llq -l jobid
# will list details about job whose jobid

Determining Resource Usage for a Job Step: jobinfo

The jobinfo command is a homegrown command that reports information about a running job. We urge you to always include this program at the end of your job scripts and use its information as a guide to setting realistic limits for wallclock time and memory. The sample output below is self-explanatory.

Job Step ID:                   f1n2.1234.0
Job Name:                      somejob
Submission Time:               Thu Dec 11 15:13:06 CST 2008
Queue:                         mpi64
Executable:                    /scratch/someuser/job_script

Time Information for Job Step:

Wallclock Time Used:           23:36:19 <--------------------------
Wallclock Time Requested:      48:00:00 <--------------------------
Wait Time:                     00:00:00
Preempted Time:                04:00:00

Your job step used less than 50% of its requested walltime.  A more accurate
wallclock request can help reduce the scheduling delay for your jobs.

Resource Utilization Per Node:

Node   CPU Util CPUTime Used Memory Util Memory Used
------ -------- ------------ ----------- -----------
f1n5     99.81%    140:32:38       8.83%        1696
f3n6     99.81%    140:32:33       8.74%        1679
f4n1     99.78%    140:29:57       8.59%        1649
f4n5     99.80%    140:31:25       8.88%        1704

TOTAL:             562:06:33                    6728

All times are in HH:MM:SS format.  Memory values are in megabytes.
Utilization is the ratio of consumed resources versus requested resources.
The node utilization statistics are currently inaccurate when a job is
preempted.  We are currently (circa, Dec 2008) investigating this problem with IBM.

Determining Why a Job Step is Waiting

The llq -s command will show the scheduling information for a job. In particular, the command can explain why a job has not been scheduled yet.

hydra# llq -s f1n2.198173.0

==================== EVALUATIONS FOR JOB STEP f1n2.198173.0 ====================

Step state                       : Idle
Considered for scheduling at     : Wed Apr  1 14:46:06 CDT 2009
Top dog estimated start time     : Wed Apr  1 17:09:47 CDT 2009

Minimum initiators needed: 16 per machine, 32 total.
Machine f2n4 does not support job class mpi32.
14 machines do not support job class mpi32.
Not enough resources to start now.
max_total_tasks=384 is to be exceeded for class mpi32: 367 is in use, 367 is reserved, 32 minimum is needed.
This step is a top-dog.

Some UNIX-level LL Environment Variables

VariableDescription
$LOADLBATCH When set to yes it indicates running under LL.
$LOADL_STEP_ID Contains the job's step id.
$LOADL_JOB_NAME Its value is the same as the value of #@ job_name in your job file.
$LOADL_PROCESSOR_LIST Contains a list of hostnames allocated for the job.
IMPORTANT: This variable is NOT set if your job is using more than 128 tasks.
$LOADL_STEP_INITDIR Contains value of initial working directory
$LOADL_STEP_TYPE contains job type (SERIAL, PARALLEL)
$LOADL_STEP_CLASS Contains contains name of class/queue of that the job was scheduled under
$TMPDIR In batch you can choose to execute your job in any permissible directory. The default, when the initialdir command keyword is not set, is the directory you submit your job from. A typically good choice, especially for high-volume I/O jobs, is $TMPDIR, which is created at a job's start and deleted at its termination. However, any needed input files have to be moved there first. The value of TMPDIR for batch jobs is set automatically to /work/jobid. This directory stays in existence only for the duration of the job. You must explicitly save any files you need before job completion. This is a large disk area that, typically, yields better I/O performance. Direct file transfers to or from a batch job and the tape archive or a remote host are not possible, except for jobs submitted to the interactive queue.

Useful LL commands

The following commands are for common tasks involving the LL batch system on hydra. More information about batch processing can be obtained from the following man pages: llsubmit, llq, llclass, llcancel, llstatus, etc.

CommandAction
llsubmit jobfile Submits a jobfile
llq [-l jobid] [-w jobid] [-s jobid] All return info about job with id=jobid. The -w option returns run utilization, cpu and memory, information
llclass [-l] [class] Lists info about configured classes/queues
llcancel jobid Terminates a job with id=jobid
llstatus [-R] Lists status info about all the nodes used by LL. The -R option describes the available and maximum resources (processors, memory, adapter windows) per node
qlimit Lists resource limit information per class/queue. A practical and useful command
listjobs List all batch jobs in a different format, accepts most of the command line options from the llq command
listnodeusage [-j] [-n] Lists nodes used per job (-j option, the default) or list jobs assigned per node (-n option). The default option is -j.
jobinfo [stepid] Returns the resource utilization information for any running job. See this link for an example and for more information.

Queue Configuration

The jobs you submit are typically routed to "service stations" that are called classes or queues. We talked about them at the beginning. This routing delivers a job to a specific class on the basis of the resources you ask and other parameters you specify. Different classes are configured with different resource (e.g., CPUs, memory) limits, execution priorities, number of run slots, and so on. Some types of jobs are allocated more resources than others. Also, specific sets of nodes are assigned to run jobs that originate from only certain queues. Information on the above issues is obtained from the output of two commands, qlimit and listnodeusage.

The output of the qlimit command can be found at this link which is updated every 5 minutes.

All mpi jobs, 16-way or higher, intended for the mpi queues should occupie full nodes, i.e., specify, ConsumableCpus(16). For related discussion, see the subsections on the interactive queue and task distribution.

hydra# listnodeusage
Job ID            Owner        Class  Cpus  ST  Node(Tasks,CCpusPerTask)
------            -----   ----------  ----  --  ------------------------
f1n2.157582.0     mudai        mpi32    32   R  f1n4(16,1), f1n6(16,1)
f1n2.157610.0   ouo2224        mpi16    16   R  f4n8(16,1)
f1n2.157616.0   hhp0872        mpi32    32   R  f3n10(16,1), f4n5(16,1)
f1n2.157712.0   s0l4620        mpi32    32   R  f2n9(16,1), f4n7(16,1)
f1n2.157727.0   rjc1212        mpi32    32   R  f3n2(16,1), f4n4(16,1)
f1n2.157924.0     giese    geo_group    32   R  f2n10(16,1), f2n8(16,1)
f1n2.158027.0   pingluo     smp_long     1   R  f1n2(1,1)
f1n2.158037.0    jenn04   smp_normal     8   R  f2n1(1,8)
f1n2.158053.0   s0l4620   smp_normal     8   R  f2n2(8,1)
f1n2.158060.0    jenn04   smp_normal     8   R  f2n6(1,8)
f1n9.156779.0   yubofan        mpi32    32   R  f3n6(16,1), f4n2(16,1)
f1n9.156859.0   yubofan        mpi32    32   R  f3n3(16,1), f3n5(16,1)
f1n9.156860.0   yubofan        mpi32    32   R  f3n1(16,1), f3n7(16,1)
f1n9.156861.0   yubofan        mpi32    32   R  f4n1(16,1), f4n3(16,1)
f1n9.157127.0     mudai        mpi32    32   R  f3n8(16,1), f3n9(16,1)
f1n9.157239.0   q0s1711        mpi32    32   R  f1n5(16,1), f1n7(16,1)
f1n9.157273.0   rjc1212        mpi32    32   R  f3n4(16,1), f4n10(16,1)
f1n9.157482.0   hhp0872     smp_long     8   R  f2n7(8,1)
f1n9.157484.0   hhp0872     smp_long     8   R  f4n6(8,1)
f1n9.157568.0    rivera        mpi32    28   R  f1n8(14,1), f4n9(14,1)
f1n9.157573.0   pingluo     smp_long     1   R  f1n9(1,1)
f1n9.157577.0    jenn04   smp_normal     8   R  f2n5(1,8)

Total                                  478

The Interactive Queue

This is a queue that services only interactive jobs. These run on only the two login nodes, hydra and hydra2. The maximum number of cpus you can engage during a run is 16, but with a maximum 8 per node only (ConsumableCpus(8)). There are two ways to use the interactive queue for MPI jobs: poe command line style and by regular job file submission, using the llsubmit command with the #@ class = interactive specified in the job file. In the former case, the batch system is activated by the -resd yes specification.

Login command-line Example

poe ./prog.exe -resd yes -procs 4 -tasks_per_node 4 -nodes 1 -euilib us -single_thread yes
poe ./prog.exe -resd yes -procs 8 -tasks_per_node 4 -nodes 2 -euilib us -single_thread yes

An example job file for the interactive queue can be found here.

The "smp_only" Queue

The smp_only queue has been configured for single node jobs (serial, OpenMP, MPI, or hybrid). There are four nodes exclusively available for only this queue. The caveats for using this queue are listed below:

The limits (per-job, per-user, and class-wide) for the smp_only queue can be found here.

Special Queues

LL will direct your job to an appropriate queue (or class) based upon the resources (mostly number of processors and then memory) you specify in the job file. For your job to be executed via the special access queues, cs_group, geo_group, or special, you must have permission from your departmental representative in order to become a member of the queue. In addition, you now add the #@ class = queue_name directive to your job.

There are several aspects of the above three queues that you need to know, either as a general user or as a special one who has access to one of them. Their existence, allocated resources, and higher scheduling priority, reflects the fact that corresponding university entities (the Vice President for Research Office, Computer Science Department, School of Geoscience) have made financial contributions towards the purchase of the HYDRA cluster. A user gains access to one these queues after being authorized by his/her departamental representative, who in turn notifies us.

All such jobs have higher scheduling priority. In addition, those submitted through the special, geo_group, and cs_group queues can also preempt other jobs running in those queues. That is, other jobs can be suspended and swapped, if need be, to make room for these special jobs to run. A suspended job resumes normal execution when sufficient resources (e.g., processors & memory) become available again in the originally assigned configuration, i.e., same nodes, same memory, etc. The list of classes/queues which are preemptible may vary depending on operational needs. Currently (circa mid May 2008), jobs from the cs_group and geo_group queues can only preempt jobs from the smp_ queues. Jobs from the special queue can preempt jobs from the smp_ and mpi queues. Typically, the special queue is intermittently active during the hurricane season (June - December).

If a user can show that he/she has a job with special needs that cannot be accomodated by the existing queue structure and resource limits, we may be able to temporarily create special queues to accomodate those needs. Such requests must be conveyed to the full-time staff at the facility along with a sufficiently detailed description of the problem being encountered. If authorized, special-purpose queues are restricted for the exclusive use of the requesting person(s) and are deleted once the user need is satisfied. In order to make such a request, please email the Help Desk with a justification.

Example Job Files

The following are sample batch job files for various job types:

Example Serial Job File

#----------------- start of sample_serial.job ----------------------
#@ shell                = /bin/ksh
#@ comment              = Serial (1-proc) job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = serial_jacobi
#@ input                = $(initialdir)/input2.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = serial
#@ wall_clock_limit     = 00:50:00
# Specifies, per task/process, one cpu/core and 4 giga-bytes of memory
#@ resources            = ConsumableCpus(1) ConsumableMemory(4gb)
#@ notification         = error
#@ queue
#
./serial_jacobi.exe
#
/usr/local/bin/jobinfo
#---------------- end of sample_serial.job ----------------------

This serial example requests 4 GB of memory for 50 minutes of execution for a serial program.

Example OpenMP Job File

#------------ Example job file: sample_openmp.job --------------
#@ shell                = /bin/ksh
#@ comment              = 8-proc OpenMP Job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = omp_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel   
# Specifies, per task/process, 8 cpus/cores and 4 giga-bytes of memory
#@ resources            = ConsumableCpus(8) ConsumableMemory(4gb)
#@ wall_clock_limit     = 00:50:00
#@ node                 = 1
#@ total_tasks          = 1
#@ notification         = error
#@ queue
#
cd $TMPDIR
cp $LOADL_STEP_INITDIR/omp_poisson_1d.exe .
#
export OMP_NUM_THREADS=8
export OBJECT_MODE=64
export OMP_DYNAMIC=FALSE           # Most common but not always
export MALLOCMULTIHEAP=HEAPS:8
export AIXTHREAD_SCOPE=S
#
./omp_poisson_1d.exe
#
/usr/local/bin/jobinfo
#------------------ End of sample_openmp.job -------------------

Please note that the LL directives, node and total_tasks (or tasks_per_node), must be set to 1 in pure OpenMP jobs. It is in the ConsumableCpus(nn) directive that you will set nn equal to the value of OMP_NUM_THREADS. This number, nn, cannot exceed 16 in this case. The ConsumableMemory setting must reflect the total memory needed by all OpenMP threads under the control of one process/task within a p575 node. This number cannot exceed ~25GB. Note also that environment variables OMP_NUM_THREADS and OBJECT_MODE could alternatively have been set as part of the #@environment directive.

Example MPI Job File

#----------------- start of sample_mpi.job ----------------------
#@ shell                = /bin/ksh
#@ comment              = 128-proc, 8-node MPI job
#@ initialdir           = $(home)/mpi/jacobi
#@ job_name             = p_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid) 
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ environment          = MP_SINGLE_THREAD=yes
# Specifies, per task/process, one cpu/core and 1.5 giga-bytes of memory
#@ resources            = ConsumableCpus(1) ConsumableMemory(1500mb)
#@ wall_clock_limit     = 01:50:00
#@ network.MPI          = sn_single, shared, US
#@ node                 = 8
#@ tasks_per_node       = 16
# Specifies, per node, 16 tasks (e.g., MPI processes) and 16*1.5 giga-bytes of memory
#@ notification         = error
#@ queue
#
cd $TMPDIR              # Operate in a perf I/O area
cp $LOADL_STEP_INITDIR/p_poisson_1d.exe .
#
poe ./p_poisson_1d.exe -shared_memory yes
#
/usr/local/bin/jobinfo
#---------------- end of sample_mpi.job ----------------------

This MPI example requests 128 cpus across 8 nodes exclusively with 16 tasks per node for 1 hour and 50 minutes of execution time.

Example OpenMP-MPI Job File

#----------------- start of sample_hybrid.job ----------------------
#@ shell                = /bin/ksh
#@ comment              = 128-proc, 8-node MPI job
#@ initialdir           = $(home)/mpi/jacobi
#@ job_name             = p_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid) 
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ environment          = MP_SINGLE_THREAD=yes; OMP_NUM_THREADS=16
# Specifies, per task/process, 16 cpus/cores and 24 giga-bytes of memory
#@ resources            = ConsumableCpus(16) ConsumableMemory(24000mb)
#@ wall_clock_limit     = 01:50:00
#@ network.MPI          = sn_single, shared, US
#@ node                 = 8
# Specifies, per node, 1 task (e.g., MPI processes) and 1*24 giga-bytes of memory
#@ tasks_per_node       = 1
#@ notification         = error
#@ queue
#
cd $TMPDIR              # Operate in a perf I/O area
cp $LOADL_STEP_INITDIR/p_poisson_1d.exe .
#
poe ./p_poisson_1d.exe -shared_memory yes
#
/usr/local/bin/jobinfo
#---------------- end of sample_hybrid.job ----------------------

This OpenMP-MPI example requests 128 cpus across 8 nodes exclusively with 1 task per node and 16 consumable cpus (in this case mapped on OpenMP threads) per task for 1 hour and 50 minutes of execution time. The product of the tasks per node and the consumable cpus per task must not exceed 16, which is the number of physical cores on a node. Similarly, the amount of memory per node (=tasks_per_node*memory_sz_per_task), here set to 24 GB, must not exceed 25 GB. Please note also two more things: (1) your program needs to be compiled and linked with both OpenMP and MPI in order to be used in such a job file; and, (2) the OMP_NUM_THREADS environment variable and other OpenMP environment variables must be set through the LL #@ enviroment directive, if they are to be propagated to all the MPI tasks.

Example Interactive Job File

#----------------- start of sample_interactive.job ----------------------
#@ shell                = /bin/ksh
#@ comment              = 8-proc, 2-node MPI job for INTERACTIVE CLASS
#@ initialdir           = $(home)/mpi/jacobi
#@ job_name             = p_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ environment          = MP_SINGLE_THREAD=yes
#@ resources            = ConsumableCpus(1) ConsumableMemory(2000mb)
#@ wall_clock_limit     = 01:50:00
#@ network.MPI          = sn_single, shared, US
#@ node                 = 2
#@ tasks_per_node       = 4
#@ notification         = error
#@ class                = interactive
#@ queue
#
cd $TMPDIR              # Operate in a perf I/O area
cp $LOADL_STEP_INITDIR/p_poisson_1d.exe .
#
poe ./p_poisson_1d.exe -shared_memory yes
#
/usr/local/bin/jobinfo
#
#---------------- end of sample_interactive.job ----------------------

This interactive queue example file requests 8 tasks across 2 nodes in the interactive queue. For more information about the interactive queue, see this section.

Example Multi-Step Job File

#@ shell		= /bin/ksh
#@ job_name             = p_poisson_1d
#@ network.MPI          = sn_single, shared, US
#@ environment          = MP_SINGLE_THREAD=yes
#@ notification         = error

#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ initialdir           = $(home)/mpi/jacobi/
#@ resources            = ConsumableCpus(1) ConsumableMemory(1000mb)
#@ wall_clock_limit     = 02:00:00
#@ node                 = 1
#@ tasks_per_node       = 2
#@ queue

#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ initialdir           = $(home)/mpi/jacobi/
#@ resources            = ConsumableCpus(1) ConsumableMemory(1000mb)
#@ wall_clock_limit     = 01:00:00
#@ node                 = 1
#@ tasks_per_node       = 4
#@ queue

#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ initialdir           = $(home)/mpi/jacobi/
#@ resources            = ConsumableCpus(1) ConsumableMemory(1000mb)
#@ wall_clock_limit     = 00:30:00
#@ node                 = 1
#@ tasks_per_node       = 8
#@ queue
poe ./p_poisson_1d.exe -shared_memory yes
#
/usr/local/bin/jobinfo
#

This example job file will result in 3 job steps with 1, 4, and 8 tasks respectively. Each job step will execute the same program with POE but with a different number of tasks and different wall clock limits.

Task Distribution for MPI Jobs

There are several considerations you may want to keep in mind when specifying how you want LL to distribute your multi-tasked jobs across the 38 nodes. We will deal with this issue here in a limited but, we hope, useful manner. Consider, for instance, the LL specifications, consistent for a 128p run,

#@ node                 = 8
#@ tasks_per_node       = 16

or

#@ node                 = 16
#@ tasks_per_node       = 8

The first setting for an MPI code will, on average and other things being equal, yield the better performance in wall-clock time because it causes less communication delays across the HPS network. (The implicit assumption here is that there is a rough load balance across all tasks in a job, both on the compute and communication aspects). The second specification is justifiable when each task requires considerably more memory than 1.5GB per task (or MPI process). A communication pattern may also sometimes make the second option attractive (but see also explanation on task_geometry keyword). That might occur, for instance, if 16 tasks per node engage in a communication pattern that saturates the two HPS adapters. The seriously unattractive aspect of the second option is that by requesting a high number of nodes (16) it will experience, on average, longer scheduling delays when the task distribution pattern across the nodes is dominated by 16 (tasks) per node or it is mixed. From the system efficiency point of view and for easier management a uniform task distribution pattern is desirable. That is, all jobs should place only 4p (or 4 tasks), or only 8p, or only 16p in a node. Whether with 1p or 16p, a uniform task distribution per node for the whole system is highly desirable, otherwize protracted scheduling delays can occur. Since we must select one, let that be, tasks_per_node=16 combined with the minimum needed nodes for jobs directed to the mpi16, mpi32, ... queues (see the subsection on the Interactive Queue for an important exception). For mpi jobs that require 16p or less, the node specification should be 1, unless again large memory per task requirements are in effect. Here are two typical specifications:

#@ node                 = 1
#@ tasks_per_node       = 16

#@ node                 = 1
#@ tasks_per_node       = 4

When a multiprocessor job is such that it can benefit in performance when certain combinations of its tasks (i.e., MPI processes) are scheduled together on the same node, you can use the task_geometry command keyword. For an 12-processor job, for instance, you could specify,

#@ task_geometry        = {(0,2,4,6) (3,9) (4,5,7,8) (10,11)}

This assignment asks for four nodes such that the first engaged node picks up tasks (e.g., MPI processes) 0, 2, 4, and 6, the second, tasks 3 and 9, etc. All 12 tasks must be listed. Each pair of parentheses contains the task IDs assigned to a single node. This directive must be used by itself only in specifying node and task distribution and/or allocation. The task_geometry feature can be very helpful, for instance, in a parallel job with heavy communication traffic only between specific tasks. Such tasks ought to be placed in the same node as much as possible.

Policies and Best Practices

Batch system policies are approved by the Steering Committee, review@sc.tamu.edu, and may on occasion change to reflect changing needs and load conditions. Your adherence to what we say below will be appreciated. What we aim at is to convince you that a little care on your part in doing certain things right will go a long way to keep hydra efficiently and fairly run. Very reluctantly, in order to maintain fairness and efficiency we will on occasion prematurely terminate jobs. The Batch System Policies lists common reasons for terminating a job by the staff.

Appropriate Job Resource Requirements

You should not, as a matter of practice, set resource levels for your job to maximal queue values unless you actually need to. Larger settings are harder to satisfy and, hence, will delay your job's execution on a busy system. This is particularly true when the resource is memory and/or the number of CPUs. Set job resource limits to the lowest possible level consistent with a successful completion. On this point, for example, you need to make sure that if you run commercial code, say, Gaussian, ABAQUS, or FLUENT, the native/internal resource limits which you specify for them and the resource limits you specify in the job file MUST match. If you need help in setting the latter, please contact the Help Desk for assistance.

Queued Jobs Not Executing

The batch system puts limits on the total number of resources a user may use and the total number of jobs a user can run. Also, each queue has limits on the total number of jobs it can run and a limit on the number of jobs it can run per user.

You may find that there may be available resources (eg. cpus) but your job may still be queued because of one of these limits.

Jobs Using Files From the Tape Archive

If your job requires files from the tape archive (or a remote host), we recommend that you first manually copy these files from the archive to your, say, /tmp1 directory on hydra before you submit your job. The objective here is to avoid possible delays during batch processing.

Proper LL batch job scripts

The first line of ANY LL batch job script should be:

#@ shell =/bin/ksh

For your scripts, please use only Bourne Shell compatibles or derivatives: ksh (Korn), bash (BASH), sh (Bourne). Our main testing efforts are against these shells. Avoid coding in the C-shell. The C-shell has widely known bugs and vulnerabilities, as well as limited I/O functionality. We do not support it nor for the most part do any vendors.

Proper startup scripts

We strongly discourage you from transfering startup scripts (.login, .profile, etc.) from other systems, or casually changing the ones that your account comes with, unless you understand the effects of those changes. And forget about PBS or LSF batch scripts working on this cluster. This admonition applies to application controlled files as well, such as the ones used by ABAQUS, CFX, etc.

Additional Information

More information about batch processing can be obtained from the following man pages: llsubmit, llq, llclass, llcancel, and llstatus.