Batch Processing using the LoadLeveler (LL)

Last update on Tuesday, 15-Apr-2008 18:03:30 CDT.

Batch, or batch processing, here will refer to the capability of running jobs through an intermediate (software) subsystem, and not directly through the UNIX kernel. The latter occurs, for instance, when you run a command from the login prompt. A job is a collection of related processes. Batch mode execution provides for a controlled process over job scheduling and resource allocation. The LoadLeveler (LL), the batch system on agave, and other such systems accomplish this control by defining various queues. These are a major component of the software infrastructure for managing the arrival and execution of jobs. Each queue is configured with a set of attributes such as name, queue priority, queue resource limits, and job run limits. The batch system allows users to overcome resource limits imposed on interactive (sometimes referred to as "command-line") processing and to regulate the execution flow of jobs on the basis of specific policies.

The LL is IBM's batch facility on agave. It can run jobs on up to 32 processors. In order to run a job, you must first construct a job file. That is a text file that contains LL directives which specify the job's features and run characteristics. Upon submission, your job receives a job id, lands into one of the configured queues or "classes", as LL calls them, and from there it is appropriately scheduled for execution. Below we provide a minimum of information about LL commands, batch files (through samples), classes, environment variables, and other related information. We proceed mostly with examples.

Useful LL commands

The following commands are for common tasks involving the LL batch system on agave. More information about batch processing can be obtained from the following man pages: llsubmit, llq, llclass, llcancel, llstatus, etc.

CommandTask
llsubmit jobfile Submit a jobfile to LL
llq [-l jobid] Returns info about job with id=jobid
llclass [-l] [class] Lists info about configured classes/queues
llcancel jobid Terminates a job with id=jobid
llstatus [-R] Lists status info about all the nodes used by LL. The -R option describes the available and maximum resources (processors, memory, adapter windows) per node
qlimit Lists resource limit information per class/queue. A practical and usefull command
listjobs List all batch jobs in a different format, accepts most of the command line options from the llq command
listnodeusage List nodes used per job (-j option) or list jobs assigned per node (-n option). The default option is -j.
llhist Returns job resource information on completed jobs for accounting purposes. This script scans the batch system history files from the last 28 days.

Job Files

To run batch job you first construct a job file. A job file is a text file with LL directives and Unix commands. Job file entries consists of:

  1. LL command keyword statements. These start with #@
  2. Shell command statements.
  3. Job command variables. These you specify using the, $(var), syntax as parameters to LL keywords.
  4. Comments, which begin with a #.

The following is a sample MPI batch job file for the LL batch system:


#------------ Example job file: sample_mpi.job --------------
#@ shell                = /bin/ksh
#@ comment              = 8-proc, MPI job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = p_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ environment          = COPY_ALL; MP_SINGLE_THREAD=yes; MP_PROCS=8
#@ resources            = ConsumableCpus(1) ConsumableMemory(500mb)
#@ wall_clock_limit     = 00:50:00
#@ node                 = 1
#@ tasks_per_node       = 8
#@ notification         = error
#@ queue
#
poe ./p_poisson_1d.exe -shared_memory yes    
#
#------------------ End of sample_mpi.job -------------------

A detailed description of this job's attributes follows:

Line Description
#@ shell = /bin/ksh The job will be executed under the ksh shell, the recommended and the only supported shell for batch job scripts.
#@ comment = 8-proc, MPI job The comments for the batch job file.
#@ initialdir = $(home)/mpi/jacobi/ Sets the initial work directory for the job to /home/username/mpi/jacobi. When not specified, LL sets it by default to the directory you submitted the job from. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified directory.
#@ job_name = p_poisson_1d Sets the LL job name to p_poisson_1d
#@ input = $(initialdir)/input.dat Sets the job's standard input to /home/username/mpi/jacobi/input.dat. When not set, LL sets it to /dev/null. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified file.
#@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid) Sets the job's standard error to p_poisson_1d.oscheddhost.jjjj.ssss. Where scheddhost is either of the scheduling nodes f1n2 or f1n9, jjjj is the job id, and step id is ssss.
#@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid) Sets the job's standard output to the same file as standard error. That is, std out and std error are concatenated
#@ job_type = parallel Instructs that this job is parallel. Other setting is serial. The default setting is serial if job_type is not specified.
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes; MP_PROCS=8 Specifies, via COPY_ALL, that all the environment variables of your login process be made available to your job. Next, it sets the poe enviroment variables MP_SINGLE_THREAD and MP_PROCS (see man poe). Any environment variables appearing and set on this line before the COPY_ALL and also specified in your login control files, will have their values set to those of the login process. Many times use of COPY_ALL is not needed
#@ resources = ConsumableCpus(1) ConsumableMemory(500mb) This LL directive pertains to the allocation of resources per-task_per-node. Here it allocates one processor and 500mb of memory per MPI process or task. This is the standard setting (1 cpu) in an MPI program. If MPI processes also spawn OpenMP threads, however, then set resources = ConsumableCpus(num_threads) ConsumableMemory(msz) per task, where msz is the aggregate amount of memory taken up by num_threads threads, and where num_threads=$OMP_NUM_THREADS. For pure OpenMP programs, num_threads=$OMP_NUM_THREADS also. This number cannot exceed 24
#@ wall_clock_limit = 00:50:00 Sets the wall clock time for the duration of the job to 50 minutes. The format here is hh:mm:ss
#@ node = 1 Specifies that LL use one node
#@ tasks_per_node = 8 Specifies that LL use 8 tasks (MPI processes here) per node.
#@ notification = error LL will report to user via e-mail on agave in case an error condition ocurrs. To receive the e-mail on another machine, say, on NEO, specify also: #@notify_user = userid@neo.tamu.edu
#@ queue Schedule job for execution. A required keyword
poe ./p_poisson_1d.exe -shared_memory yes poe executes the p_poisson_1d.exe binary using shared memory communication protocol for processes within a node

Job Submission & Monitoring

Use the llsubmit command to submit your job as shown below:

$ llsubmit sample_mpi.job
llsubmit: Processed command file through Submit Filter: "/home/loadl/job_filter.pl".
llsubmit: The job "agave.tamu.edu.377" has been submitted.

When submitted to LL, this job is given a unique job ID, agave.tamu.edu.337. You can use the llq command with various options to monitor your job as shown below:

$ llq
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
agave.377.0               user1       2/22 11:12 R  50  p8       agave


1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted

$ llq -l agave.377.0 
# will list details about job whose job id is f1n2.338.0

$ llq -w agave.377.0 
# will list cpu time and main memory per node where he jobs happens to have
processes on

$ llq -s agave.377.0 
# will list same information as the -l option, but will also list LL scheduling messages. Useful
especially in detecting the reasons for a job not running!


The following is a sample OpenMP batch job file for the LL batch system on agave ONLY:


#------------ Example job file: sample_openmp.job --------------
#@ shell                = /bin/ksh
#@ comment              = 8-proc OpenMP Job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = omp_poisson_1d
#@ input                = $(initialdir)/input.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = parallel
#@ resources            = ConsumableCpus(1) ConsumableMemory(2gb)
#@ wall_clock_limit     = 00:50:00
#@ node                 = 1
#@ total_tasks          = 8
#@ notification         = error
#@ queue
#
cd $TMPDIR
cp $LOADL_STEP_INITDIR/omp_poisson_1d.exe .
#
export OMP_NUM_THREADS=8
#
omp_poisson_1d.exe
#
#------------------ End of sample_openmp.job -------------------

In this job the execution takes place in the high-performance $TMPDIR area. The value of TMPDIR for batch jobs is set automatically to /work/jobid. This directory stays in existence only for the duration of the job. LOADL_STEP_INITDIR is an LL shell environment variable whose value is the setting for #@ initialdir. Any generated output that is needed should obviously be copied elsewhere, $LOADL_STEP_INITDIR, /tmp1/username, $HOME/username, etc.

Please note that LL directives, ConsumableCpus must be set to 1 in OpenMP jobs. It is in the total_tasks=nn directive that you will set nn equal to the value of OMP_NUM_THREADS. This number, nn, cannot exceed 24. The ConsumableMemory setting must reflect the total memory needed by all OpenMP threads under the control of one process/task. This number cannot exceed ~2GB/task

The following is anothe sample batch job file:

#----------------- start of sample1.job ----------------------
#@ shell                = /bin/ksh
#@ comment              = Serial (1-proc) job
#@ initialdir           = $(home)/mpi/jacobi/
#@ job_name             = serial_jacobi
#@ input                = $(initialdir)/input2.dat
#@ error                = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output               = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type             = serial
#@ wall_clock_limit     = 00:50:00
#@ resources            = ConsumableCpus(1) ConsumableMemory(2gb)
#@ notification         = error
#@ queue
#
serial_jacobi.exe
#
echo "=============================================================="
llq -w $LOADL_STEP_ID
echo "=============================================================="
llq -l $LOADL_STEP_ID
#---------------- end of sample1.job ----------------------

Common LL Environment Variables

LL Environment VariableDescription
$LOADLBATCH When set to yes it indicates running under LL.
$LOADL_STEP_ID Contains the job's step id.
$LOADL_JOB_NAME Its value is the same as the value of #@ job_name in your job file.
$LOADL_STEP_INITDIR Contains value of initial working directory
$LOADL_STEP_TYPE contains job type (SERIAL, PARALLEL)
$LOADL_STEP_CLASS Contains contains name of class/queue of that the job was scheduled under
$TMPDIR In batch you can choose to execute your job in any permissible directory. The best choice is $TMPDIR, which is created at a job's start and deleted at its termination. The value of TMPDIR for batch jobs is set automatically to /work/jobid. This direcory stays in existence only for the duration of the job. You must explicitly save any files you need before job completion. This is a large disk area that, typically, yields better I/O performance. Direct file transfers to or from a batch job and the tape archive or a remote host are not possible.

Queue Configuration

The configuration of any queue may undergo occasional changes, either to reflect the changing needs of the user community or to improve machine utilization and to better manage limited resources. In order to determine the current attributes and resource limits imposed on any queue you can use the qlimit command. The qlimit command provides a partial account of queue characteristics for all queues on the system.

Special Queues

LL will direct your job to an apppropriate queue (or class) based upon the resources (mostly number of processors and then memory) you specify in the job file. For your job to be executed via the special access queues, p16 or p24, you must have permission from your departamental representative in order to become a member of the queue.

Jobs belonging to users who are members in one of these queues, and who in addition want to specify that their jobs be scheduled through such queues, must use the #@ class directive as in the example below,

 
#@ class        = p16

Policies and Best Practices

Batch system policies are approved by the Steering Committee, review@sc.tamu.edu, and may on occasion change to reflect changing needs and load conditions. Your adherence to what we say below will be appreciated. What we aim at is to convince you that a little care on your part in doing certain things right will go a long way to keep agave efficiently and fairly run. Very reluctantly, in order to maintain fairness and efficiency we will on occasion prematurely terminate jobs. The subsection Abnormal Job Termination lists common reasons for terminating a job by the staff.

Appropriate Job Resource Requirements

You should not, as a matter of practice, set resource levels for your job to maximal queue values unless you actually need to. Larger settings are harder to satisfy and, hence, will delay your job's execution on a busy system. This is particularly true when the resource is memory and/or the number of CPUs. Set job resource limits to the lowest possible level consistent with a successful completion. On this point, for example, you need to make sure that if you run commercial code, say, Gaussian, ABAQUS, or FLUENT, the native/internal resource limits which you specify for them and the resource limits you specify in the job file MUST match. If you need help in setting the latter, please contact the Help Desk for assistance.

Invalid Parallel Batch Jobs

Jobs requesting multiple cpus must use multiple cpus simultaneously from a single command. Running multiple independent commands in the background in a batch job script is NOT parallel processing and is not permitted. Just so there is NO misunderstanding, the following example constitutes an illustration of what is invalid parallel processing and therefore is NOT permitted.

command1 &
command2 &

Queued Jobs Not Executing

The batch system puts limits on the total number of resources a user may use and the total number of jobs a user can run. Also, each queue has limits on the total number of jobs it can run and a limit on the number of jobs it can run per user.

You may find that there may be available resources (eg. cpus) but your job may still be queued because of one of these limits.

Jobs Using Files From the Tape Archive

If your job requires files from the tape archive (or a remote host), we recommend that you first manually copy these files from the archive to your. The objective here is to avoid possible delays during batch processing.

Proper LL batch job scripts

The first line of ANY LL batch job script should be:

#@ shell =/bin/ksh

For your scripts, please use only Bourne Shell compatibles or derivatives: ksh (Korn), bash (BASH), sh (Bourne). Avoid coding in the c-shell. The c-shell has widely known bugs and vulnerabilities, as well as limited I/O functionality. We do not support it nor for the most part do any vendors.

Proper startup scripts

We strongly discourage you from transfering startup scripts (.login, .profile, etc.) from other systems, or casually changing the ones that your account comes with, unless you understand the effects of those changes. And forget about PBS or LSF batch scripts working on this cluster. This admonition applies to application controlled files as well, such as the ones used by ABAQUS, CFX, etc.

Additional Information

More information about batch processing can be obtained from the following man pages: llsubmit, llq, llclass, llcancel, and llstatus.