Batch Processing using the LoadLeveler (LL)
Last update on Tuesday, 15-Apr-2008 18:03:30 CDT.
Batch, or batch processing, here will refer to the capability of running
jobs through an intermediate (software) subsystem, and not directly through the
UNIX kernel. The latter occurs, for instance, when you run a command from the
login prompt. A job is a collection of related processes. Batch mode execution
provides for a controlled process over job scheduling and resource allocation.
The LoadLeveler (LL), the batch system on agave, and other such systems
accomplish this control by defining various queues. These are a major component
of the software infrastructure for managing the arrival and execution of jobs.
Each queue is configured with a set of attributes such as name, queue priority,
queue resource limits, and job run limits. The batch system allows users to
overcome resource limits imposed on interactive (sometimes referred to as
"command-line") processing and to regulate the execution flow of jobs on the
basis of specific policies.
The LL is IBM's batch facility on agave. It can run jobs on up to 32 processors.
In order to run a job, you must first construct a job file.
That is a text file that contains LL directives which specify the job's
features and run characteristics. Upon submission, your job receives a job id,
lands into one of the configured queues or "classes", as LL calls them, and
from there it is appropriately scheduled for execution. Below we provide a
minimum of information about LL commands, batch files (through samples),
classes, environment variables, and other related information. We proceed
mostly with examples.
Useful LL commands
The following commands are for common tasks involving the LL batch system
on agave. More information about batch processing can be obtained from the
following man pages: llsubmit, llq, llclass, llcancel, llstatus, etc.
| llsubmit jobfile |
Submit a jobfile to LL |
| llq [-l jobid] |
Returns info about job with id=jobid |
| llclass [-l] [class] |
Lists info about configured classes/queues |
| llcancel jobid |
Terminates a job with id=jobid |
| llstatus [-R] |
Lists status info about all the nodes used by LL. The -R option describes the available and maximum resources
(processors, memory, adapter windows) per node |
| qlimit |
Lists resource limit information per class/queue. A practical and usefull command |
| listjobs |
List all batch jobs in a different format, accepts most of the command line options from the llq command |
| listnodeusage |
List nodes used per job (-j option) or list jobs assigned per node (-n option). The default option is -j. |
| llhist |
Returns job resource information on completed jobs for accounting purposes. This script scans the batch system history files from the last 28 days. |
Job Files
To run batch job you first construct a job file. A job file is a text file
with LL directives and Unix commands. Job file entries consists of:
- LL command keyword statements. These start with #@
- Shell command statements.
- Job command variables. These you specify using the, $(var), syntax as parameters to LL keywords.
- Comments, which begin with a #.
The following is a sample MPI batch job file for the LL batch system:
#------------ Example job file: sample_mpi.job --------------
#@ shell = /bin/ksh
#@ comment = 8-proc, MPI job
#@ initialdir = $(home)/mpi/jacobi/
#@ job_name = p_poisson_1d
#@ input = $(initialdir)/input.dat
#@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type = parallel
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes; MP_PROCS=8
#@ resources = ConsumableCpus(1) ConsumableMemory(500mb)
#@ wall_clock_limit = 00:50:00
#@ node = 1
#@ tasks_per_node = 8
#@ notification = error
#@ queue
#
poe ./p_poisson_1d.exe -shared_memory yes
#
#------------------ End of sample_mpi.job -------------------
|
A detailed description of this job's attributes follows:
| #@ shell = /bin/ksh
| The job will be executed under the ksh shell,
the recommended and the only supported shell for batch job scripts. |
| #@ comment = 8-proc, MPI job
| The comments for the batch job file.
|
| #@ initialdir = $(home)/mpi/jacobi/
| Sets the initial work directory for the job to /home/username/mpi/jacobi. When not specified, LL sets it by default to the directory you submitted the job from. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified directory.
|
| #@ job_name = p_poisson_1d
| Sets the LL job name to p_poisson_1d |
| #@ input = $(initialdir)/input.dat
| Sets the job's standard input to /home/username/mpi/jacobi/input.dat. When not set, LL sets it to /dev/null. IMPORTANT: LL will hold your job indefinitely if for whatever reason it cannot access the specified file.
|
| #@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid)
| Sets the job's standard error to p_poisson_1d.oscheddhost.jjjj.ssss. Where scheddhost is either of the scheduling
nodes f1n2 or f1n9, jjjj is the job id, and step id is ssss.
|
| #@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid)
| Sets the job's standard output to the same file as standard error. That is, std out and std error are concatenated
|
| #@ job_type = parallel
| Instructs that this job is parallel. Other setting is serial. The default setting is serial if job_type is not specified.
|
| #@ environment = COPY_ALL; MP_SINGLE_THREAD=yes; MP_PROCS=8
| Specifies, via COPY_ALL, that all the environment variables of your login process be made available to your job.
Next, it sets the poe enviroment variables MP_SINGLE_THREAD and MP_PROCS (see man poe). Any environment variables
appearing and set on this line before the COPY_ALL and also specified in your login control files, will have their
values set to those of the login process. Many times use of COPY_ALL is not needed |
| #@ resources = ConsumableCpus(1) ConsumableMemory(500mb)
| This LL directive pertains to the allocation of resources per-task_per-node. Here it allocates one processor and 500mb of memory per MPI process or task. This is the standard setting (1 cpu) in an MPI program. If MPI processes also spawn OpenMP threads, however, then set resources = ConsumableCpus(num_threads) ConsumableMemory(msz) per task, where msz is the aggregate amount of memory taken up by num_threads threads, and where num_threads=$OMP_NUM_THREADS. For pure OpenMP programs, num_threads=$OMP_NUM_THREADS also. This number cannot exceed 24
|
| #@ wall_clock_limit = 00:50:00
| Sets the wall clock time for the duration of the job to 50 minutes. The format here is hh:mm:ss
|
| #@ node = 1
| Specifies that LL use one node |
| #@ tasks_per_node = 8
| Specifies that LL use 8 tasks (MPI processes here) per node.
|
| #@ notification = error
| LL will report to user via e-mail on agave in case an error condition ocurrs. To receive the e-mail on another machine, say, on NEO, specify also: #@notify_user = userid@neo.tamu.edu |
| #@ queue
| Schedule job for execution. A required keyword
|
| poe ./p_poisson_1d.exe -shared_memory yes
| poe executes the p_poisson_1d.exe binary using shared memory communication protocol for processes within a node
|
Job Submission & Monitoring
Use the llsubmit command to submit your job as shown below:
$ llsubmit sample_mpi.job
llsubmit: Processed command file through Submit Filter: "/home/loadl/job_filter.pl".
llsubmit: The job "agave.tamu.edu.377" has been submitted.
|
When submitted to LL, this job is given a unique job ID, agave.tamu.edu.337. You can use the llq command with various options to monitor your job as shown below:
$ llq
Id Owner Submitted ST PRI Class Running On
------------------------ ---------- ----------- -- --- ------------ -----------
agave.377.0 user1 2/22 11:12 R 50 p8 agave
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
$ llq -l agave.377.0
# will list details about job whose job id is f1n2.338.0
$ llq -w agave.377.0
# will list cpu time and main memory per node where he jobs happens to have
processes on
$ llq -s agave.377.0
# will list same information as the -l option, but will also list LL scheduling messages. Useful
especially in detecting the reasons for a job not running!
|
The following is a sample OpenMP batch job file for the LL batch system on agave ONLY:
#------------ Example job file: sample_openmp.job --------------
#@ shell = /bin/ksh
#@ comment = 8-proc OpenMP Job
#@ initialdir = $(home)/mpi/jacobi/
#@ job_name = omp_poisson_1d
#@ input = $(initialdir)/input.dat
#@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type = parallel
#@ resources = ConsumableCpus(1) ConsumableMemory(2gb)
#@ wall_clock_limit = 00:50:00
#@ node = 1
#@ total_tasks = 8
#@ notification = error
#@ queue
#
cd $TMPDIR
cp $LOADL_STEP_INITDIR/omp_poisson_1d.exe .
#
export OMP_NUM_THREADS=8
#
omp_poisson_1d.exe
#
#------------------ End of sample_openmp.job -------------------
|
In this job the execution takes place in the high-performance $TMPDIR area.
The value of TMPDIR for batch jobs is set automatically to /work/jobid. This
directory stays in existence only for the duration of the job.
LOADL_STEP_INITDIR is an LL shell environment variable whose value is the
setting for #@ initialdir. Any generated output that is needed should obviously
be copied elsewhere, $LOADL_STEP_INITDIR, /tmp1/username, $HOME/username,
etc.
Please note that LL directives, ConsumableCpus
must be set to 1 in OpenMP jobs. It is in the total_tasks=nn directive that
you will set nn equal to the value of OMP_NUM_THREADS. This number, nn, cannot
exceed 24. The ConsumableMemory setting must reflect the total memory needed by all OpenMP threads under
the control of one process/task. This number cannot exceed ~2GB/task
The following is anothe sample batch job file:
#----------------- start of sample1.job ----------------------
#@ shell = /bin/ksh
#@ comment = Serial (1-proc) job
#@ initialdir = $(home)/mpi/jacobi/
#@ job_name = serial_jacobi
#@ input = $(initialdir)/input2.dat
#@ error = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ output = $(job_name).o$(schedd_host).$(jobid).$(stepid)
#@ job_type = serial
#@ wall_clock_limit = 00:50:00
#@ resources = ConsumableCpus(1) ConsumableMemory(2gb)
#@ notification = error
#@ queue
#
serial_jacobi.exe
#
echo "=============================================================="
llq -w $LOADL_STEP_ID
echo "=============================================================="
llq -l $LOADL_STEP_ID
#---------------- end of sample1.job ----------------------
|
| $LOADLBATCH |
When set to yes it indicates running under LL. |
| $LOADL_STEP_ID |
Contains the job's step id. |
| $LOADL_JOB_NAME |
Its value is the same as the value of #@ job_name in your job file.
|
| $LOADL_STEP_INITDIR |
Contains value of initial working directory |
| $LOADL_STEP_TYPE |
contains job type (SERIAL, PARALLEL) |
| $LOADL_STEP_CLASS |
Contains contains name of class/queue of that the job was scheduled under |
| $TMPDIR |
In batch you can choose to execute your job in any permissible directory. The best choice is $TMPDIR, which is created
at a job's start and deleted at its termination. The value of TMPDIR for batch jobs is set automatically to /work/jobid. This direcory stays in existence only for the duration of the job. You must explicitly save any files you need before job completion. This is a large disk area that, typically, yields better I/O performance. Direct file transfers to or from a batch job and the tape archive or a remote host are not possible. |
Queue Configuration
The configuration of any queue may undergo occasional changes, either to
reflect the changing needs of the user community or to improve machine
utilization and to better manage limited resources. In order to determine the
current attributes and resource limits imposed on any queue you can use the
qlimit command. The qlimit command provides a partial account of queue
characteristics for all queues on the system.
Special Queues
LL will direct your job to an apppropriate queue (or class) based upon the
resources (mostly number of processors and then memory) you specify in the job
file. For your job to be executed via the special access queues,
p16 or p24, you must have permission
from your departamental representative in order to become a member of the
queue.
Jobs belonging to users who are
members in one of these queues, and who in addition want to specify that their
jobs be scheduled through such queues, must use the #@ class directive as in
the example below,
Policies and Best Practices
Batch system policies are approved by the Steering Committee, review@sc.tamu.edu, and may on occasion
change to reflect changing needs and load conditions. Your adherence to what
we say below will be appreciated. What we aim at is to convince you that a
little care on your part in doing certain things right will go a long way to
keep agave efficiently and fairly run. Very reluctantly, in order to maintain
fairness and efficiency we will on occasion prematurely terminate jobs. The
subsection Abnormal Job Termination lists common
reasons for terminating a job by the staff.
Appropriate Job Resource Requirements You
should not, as a matter of practice, set resource levels for your job to
maximal queue values unless you actually need to. Larger settings are harder
to satisfy and, hence, will delay your job's execution on a busy system. This
is particularly true when the resource is memory and/or the number of CPUs. Set
job resource limits to the lowest possible level consistent with a successful
completion. On this point, for example, you need to make sure that if you run
commercial code, say, Gaussian, ABAQUS, or FLUENT, the native/internal resource
limits which you specify for them and the resource limits you specify in the
job file MUST match. If you need help in setting the latter, please
contact the Help Desk for assistance.
Invalid Parallel Batch Jobs
Jobs requesting multiple cpus must use multiple cpus simultaneously from a
single command. Running multiple independent commands in the background in a
batch job script is NOT parallel processing and is not permitted. Just
so there is NO misunderstanding, the following example constitutes an
illustration of what is invalid parallel processing and therefore is NOT
permitted.
Queued Jobs Not Executing
The batch system puts limits on the total number of resources a user may use
and the total number of jobs a user can run. Also, each queue has limits on
the total number of jobs it can run and a limit on the number of jobs it can
run per user.
You may find that there may be available resources (eg. cpus) but your
job may still be queued because of one of these limits.
Jobs Using Files From the Tape Archive
If your job requires files from the tape archive (or a remote host), we
recommend that you first manually copy these files from the archive to your.
The objective here is to avoid possible delays during batch processing.
Proper LL batch job scripts
The first line of ANY LL batch job script should be:
For your scripts, please use only Bourne Shell compatibles or derivatives:
ksh (Korn), bash (BASH), sh (Bourne). Avoid coding in the c-shell. The c-shell
has widely known bugs and vulnerabilities, as well as limited I/O
functionality. We do not support it nor for the most part do any vendors.
Proper startup scripts
We strongly discourage you from transfering startup scripts (.login,
.profile, etc.) from other systems, or casually changing the ones that your
account comes with, unless you understand the effects of those changes. And
forget about PBS or LSF batch scripts working on this cluster. This admonition
applies to application controlled files as well, such as the ones used by
ABAQUS, CFX, etc.
Additional Information
More information about batch processing can be obtained from the following
man pages: llsubmit, llq, llclass, llcancel, and llstatus.
|