|
Batch, or batch processing, is the capability of running jobs outside of the
interactive login session. In this document, batch implies a complex subsystem
which provides for control over job scheduling and resource contention. On
cosmos, the batch system is part of the Portable Batch System (PBS). PBS
defines various queues, which are collections of ordered jobs lined up for
execution. The use of the term "queue" however does not imply the ordering is
"first in, first out." Each queue is defined as a set of attributes such as
queue name, queue priority, queue resource limits, and job count limits. The
batch system allows users to overcome resource limits imposed on interactive
(sometimes referred to as "command-line") processing and to evenly and
efficiently regulate the execution flow of jobs.
The interactive limit (CPU time) per login session on all systems is 20
minutes. Any violations of this limit will result in process termination. A
user may only use a maximum of two processors simultaneously for interactive
processing. A user is expected to lower this limit under heavy system loads.
Exceptions to this policy will be considered by the staff on a per case basis.
This limitation must be overcome by submitting the job in "batch mode" as
described below.
PBS Job Files
A PBS batch job script is a text file with PBS directives and Unix
commands. The PBS directives are always at the beginning of the file and are
specified in lines that start with the #PBS keyword and continue with other job
specifications. These typically describe the job's
characteristics (e.g. job name, job shell, etc.) and the resources (e.g., number of cpus, memory, etc.) it
needs. There are also several PBS environment
variables that you should be aware of.
The following is a sample batch job file for the PBS batch system on
cosmos:
#PBS -N myjob
#PBS -S /bin/bash
#PBS -j oe
#PBS -l walltime=4:00:00
#PBS -l ncpus=2,mem=1gb
ja -m # Activate Job Accounting
cd $TMPDIR
cp $PBS_O_WORKDIR/inputfile1 .
cp $PBS_O_WORKDIR/inputfile2 .
cp $PBS_O_WORKDIR/myprog .
./myprog
cp outputfile $PBS_O_WORKDIR
#
ja -st # Output a summary resource use
# can also use: qstat -f $PBS_JOBID
|
The explanation of each line is listed below. Lines that
begin with #PBS specify batch directives.
| #PBS -N myjob |
The name of the batch job will be myjob. |
| #PBS -S /bin/bash |
The bash shell will be used to interpret the batch job script. |
| #PBS -j oe |
The standard output and error streams will be merged into
the standard output stream file. The standard output stream
will be implicitly stored in $PBS_O_WORKDIR/jobname.oNNN where
jobname is the name of the job and NNN is job identifier. |
#PBS -l walltime=4:00:00
#PBS -l ncpus=2,mem=1gb |
This job requests 4 hours of walltime, 2 cpus, and 1 GB of physical memory. |
| ja -m |
Activates job accounting, which captures, among other, cpu time, memory size,
and I/O use when the job terminates. |
| cd $TMPDIR |
Make $TMPDIR the job's working directory. |
cp $PBS_O_WORKDIR/inputfile1 .
cp $PBS_O_WORKDIR/inputfile2 .
cp $PBS_O_WORKDIR/myprog . |
Copy files to be used for the job from the job summission directory,
$PBS_O_WORKDIR, directory to the $TMPDIR directory. |
| ./myprog |
Execute the program myprog. |
| cp outputfile $PBS_O_WORKDIR |
Copy the output file generated by the execution of myprog
to the $PBS_O_WORKDIR directory. |
| cd $TMPDIR |
Make $TMPDIR the job's working directory. |
| ja -st |
Prints summary (-s) job accounting information and terminates (-t) job
accounting. Lists in cummulative figures, among other, wall-clock time,
cpu time, memory, and I/O information for commands that executed between
the ja -m and ja -ct lines. We recommend use of ja in all jobs. |
The number of cpus specified for a job in a #PBS directive (-l ncpus=##)
MUST be the same as that specified for the running of a program through
its interface. Specifically, for MPI program the -np parameter of the mpirun
command must be set equal to the value of ncpus above. Similarly, for OpenMP
programs the value of the OMP_NUM_THREADS environment variable must be
set to the same value as ncpus. This requirement also applies for commercial
application programs, such as Gaussian and ABAQUS. Two sample batch job files
below illustrate the point.
Sample Batch Job File for Gaussian
#PBS -N sample -j oe
#PBS -S /bin/bash
#PBS -l walltime=10:00:00,mem=500mb,ncpus=2
ja -m
# Initialize environment with the G03 B05 SCSL module
module add g03.b05.scsl
set echo # Show issue commands in output
# Copy input files to $TMPDIR
cp sample.com $TMPDIR
# Run Gaussian 03
cd $TMPDIR
g03 < sample.com
# Copy output file to home directory
cp sample.log $HOME
# Get CPU time and other info about job
ja -st
qstat -f $PBS_JOBID
|
The Gaussian input and/or the Default.route file must specify the same number
of cpus as the PBS ncpus argument. The job output will goto sample.oNNN where
NNN is the job ID.
Sample Batch Job for ABAQUS
#PBS -N test_axi1 -S /bin/bash -j oe
#PBS -l ncpus=1,walltime=22:00:00,mem=500mb,vmem=5gb
ja -m
# uncomment this line if the abaqus module is not in your module initlist
# module load abaqus
cd $TMPDIR
cp $PBS_O_WORKDIR/axi1.inp .
abaqus job=test_axi1 cpus=1 input=axi1.inp
cp test_axi1.* $PBS_O_WORKDIR
ja -st
|
The ABAQUS ncpus argument must match the PBS ncpus argument. The job output
will goto test_axi1.oNNN where NNN is the job ID.
Job Submission: The qsub command
Use the qsub command to submit a job as shown below:
cosmos% qsub myjob
1234.cosmos
|
One of the first things that happen when a job is submitted
is the assigning of a unique job id to it by PBS. You may refer to a job
by using only the numerical part of the job id (eg. 1234).
Job Submission Options
A list of the more commonly useful options for submitting batch
jobs is listed below:
| -e path |
Defines the path to be used for the standard error stream of
the batch job. |
| -j join |
A join argument oe directs the merging of the
standard out and standard error streams into the standard out.
A join with eo merges the two streams into standard error,
If the join argument is n or the option is not specified,
the two streams will be two separate files.
|
| -l resource_list |
Specifies resources and associated maximal levels of use by the job.
Commonly used resources are ncpus, cputime, walltime, mem, vmem, and file.
Resources that are not explicitly specified will cause the assumption of default
values that are in effect for each queue. Additional sources of information
here are the listings of the qlimit command, the qstat -Qf queue
command and the pbs_resources
man page. |
| -m mail_options |
Specifies which conditions under which the server will send an
email message about the job. |
| -N name |
Declares a name for the job. |
| -o path |
Defines the path to be used for the standard output stream of
the batch job. |
| -S shell |
Declares the shell that interprets the job script. We strongly
recommend that you use the bash shell. |
| -v variable_list |
Any environment variables specified in this list will be exported
from the qsub command's environment to the job's environment. |
| -V |
All environment variables will be exported from the qsub command's
environment to the job's environment. We recommend that you
use the -v varlist option to import only the necessary environment
variables. |
Queue Structure
A queue is a software structure through which PBS manages the processing of
jobs. Batch queues are defined by a number of parameters of which the most
important are resource limits. There are several such "execution" queues from
which PBS schedules jobs for execution. Jobs are routed to the appropriate
queue based upon, for the most part, a job's resource limit specifications.
Some queues can be used only by special permission. These are generally the the
higher priority, high-cpu queues, p16, p32, p64, and ded_bench, but xlong
belongs in this category (of special permission) as well. The special-access
queues must be used only for jobs that match the queues special
characteristics. You can also see the output of the qlimit command at this
link which is updated every 5 minutes.
PBS Resources
The following resources are the more commonly used in the PBS batch
system on cosmos. Additional sources of information are the listings
of the qlimit command, the qstat -Qf queue command and the
pbs_resources man page.
| WALLTIME |
Maximum amount of wall-clock time duration for the job within the system since
the beginning of execution. The format is hh:mm:ss. ALL jobs
should specify walltime, not just cpu time. Failure to specify walltime will
cause PBS to assign a job just 5 minutes, the default value. |
| CPUTIME |
Maximum amount of cpu time that the job can consume. The format
is hh:mm:ss. |
| MEM |
Maximum amount of physical resident memory that the job can occupy. |
| PMEM |
Maximum amount of physical resident memory any process can occupy
belonging to the job. |
| VMEM |
Maximum virtual memory per job. |
| PVMEM |
Maximum virtual memory per process in the job. |
| NCPUS |
Maximum number of cpus allowed per job. |
| FILE |
Maximum size a file can attain per job. |
| MAXR |
Maximum number of jobs that can be executing concurrently in a given queue. |
| USERR |
Maximum number of jobs a user may run concurrently in a given queue. |
Job Monitoring and PBS commands
The following commands are for common tasks involving the PBS batch
system on cosmos. More information about batch processing can be obtained
from the following man pages: pbs, qsub, qstat, and qdel.
| Submit a job |
qsub jobfile |
| Show running jobs. Note, Req'd Time column is CPU time, not walltime. |
qstat -r |
| Show running jobs and when they began executing. Note, Req'd Time column is CPU time, not walltime. |
qstat -rs |
| Show the jobs that are not running. Note, Req'd Time column is CPU time, not walltime. |
qstat -i |
| Show the jobs that are not running and why they are not running. Note, Req'd Time column is CPU time, not walltime. |
qstat -is |
| Show the status of all the queues |
qstat -q |
| Show which queues you have access to |
qaccess |
| Show detailed information for a given job |
qstat -f jobid |
| Show detailed information for all queues or a specific queue |
qstat -Qf [queue_name] |
| Show all jobs |
qstat -a |
| Show all jobs for a given user |
qstat -u user |
| Show the processes under a given running job |
p_qstat jobid |
| Show the status of the batch system in a manner like top. Has a built-in
help screen for available commands. Note, used CPU time is not being reported
accurately by PBS at this time. |
bmonitor |
| Delete a given job |
qdel jobid |
| Shows the job and queue limits of various execution queues |
qlimit |
| Find all jobs over the last N days for a given user |
findjobs -n N -u username |
| Show the job history over the last N days for a given job. Format
the output to 80 columns. Note, the -w flag is necessary when the output
is sent to a pipe or a file. |
tracejob -n N -w 80 jobid |
Job Accounting
The ja command: -m -s and -t options
The ja command provides information on resource use about a whole job
or segment of it. The -m option initiates job accounting action. The -s
option outputs a summary report, and the -t option terminates job accounting.
Its use is illustrated in the sample job file. We recommend its use in all
job files. More information on ja can be found in its man page.
Job CSA Accounting - Summary Report
====================================
Job Accounting File Name : /work/28032.cosmos/.jacct5ba5041000000e24
Operating System : Linux cosmos.tamu.edu ...#1 SMP Sun Jan 23 13:49
User Name (ID) : ? (-2)
Group Name (ID) : ? (-2)
Project Name (ID) : ? (0)
Job ID : 0x5ba5041000000e24
Report Starts : 02/14/05 12:09:25
Report Ends : 02/14/05 12:10:38
Elapsed Time : 73 Seconds
User CPU Time : 111.6496 Seconds
System CPU Time : 3.2979 Seconds
Block I/O Wait Time : 0.0000 Seconds
Raw I/O Wait Time : 0.0000 Seconds
CPU Time Core Memory Integral : 457.7507 Mbyte-seconds
CPU Time Virtual Memory Integral : 2007.0859 Mbyte-seconds
Maximum Core Memory Used : 153.3438 Mbytes
Maximum Virtual Memory Used : 1011.0312 Mbytes
Characters Read : 221.6048 Mbytes
Characters Written : 231.2401 Mbytes
Blocks Read : 0
Blocks Written : 0
Logical I/O Read Requests : 23545
Logical I/O Write Requests : 18180
Number of Commands : 92
System Billing Units : 114.9475
|
The qstat -f jobid command
The qstat -f also provides detailed information about a job. Some
notable fields in the output are resources_used, queue, qtime, comment, and etime.
% qstat -f 813
Job Id: 813.cosmos
Job_Name = os2-bs4
Job_Owner = gooduser@cosmos.tamu.edu
resources_used.cpupercent = 399
resources_used.cput = 13:18:48
resources_used.mem = 1271568kb
resources_used.ncpus = 4
resources_used.vmem = 6653024kb
resources_used.walltime = 03:19:49
job_state = R
queue = long
server = cosmos
Checkpoint = u
ctime = Wed Jun 2 09:01:32 2004
Error_Path = cosmos.tamu.edu:/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-4
-h/bs4/os2-bs4.e813
exec_host = cosmos/0*4
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Wed Jun 2 09:01:32 2004
Output_Path = cosmos.tamu.edu:/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-
4-h/bs4/1.g98.out
Priority = 0
qtime = Wed Jun 2 09:01:32 2004
Rerunable = True
Resource_List.cput = 96:00:00
Resource_List.file = 10gb
Resource_List.mem = 7880704kb
Resource_List.ncpus = 4
Resource_List.pcput = 96:00:00
Resource_List.pmem = 500mb
Resource_List.pvmem = 528gb
Resource_List.ssinodes = 2
Resource_List.vmem = 528gb
session_id = 16409
Variable_List = PBS_O_HOME=/home/gooduser,PBS_O_LANG=en_US,
PBS_O_LOGNAME=gooduser,
PBS_O_PATH=/opt/intel/idb73/bin:/opt/intel/8.0-20040416/bin:/usr/kerbe
ros/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/
opt/pbs54/bin:/scratch/web/g03/bsd:/scratch/web/g03/local:/scratch/web/
g03/extras:/scratch/web/g03:/usr/X11R6/bin:/opt/pbs54/bin:/scratch/web/
g03/bsd:/scratch/web/g03/local:/scratch/web/g03/extras:/scratch/web/g03
:/home/gooduser/bin,PBS_O_MAIL=/var/mail/gooduser,PBS_O_SHELL=/bin/tcsh,
PBS_O_HOST=cosmos.tamu.edu,
PBS_O_WORKDIR=/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-4-h/bs4,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=regular
comment = Job run on node cosmos - at Wed Jun 02 at 09:01
alt_id = cpuset=web813.c
etime = Wed Jun 2 09:01:32 2004
|
Job Process Profile
The p_qstat jobid command
The p_qstat jobid lists information about the processes making up a job.
% p_qstat 14049
Job id: 14049
Job owner: zonk
Req mem: 15781888kb
Req cpus: 8
Req cpu time: 160:00:00 (hh:mm:ss)
Req walltime: 20:00:00 (hh:mm:ss)
F UID PID PPID PRI NI VSZK RSSK WCHAN STAT TTY TIME COMMAND
4 1246 9025 3868 25 0 38704K 2864K wait4 S ? 0:00 -bash
0 1246 9092 9025 25 0 38640K 2672K wait4 S ? 0:00 bash
0 1246 9097 9092 17 0 38432K 2288K ia64_r S ? 0:00 csh
0 1246 9536 9097 22 0 38640K 2688K wait4 S ? 0:00 sh
0 1246 9539 9536 22 0 38800K 2736K wait4 S ? 0:00 sh
0 1246 9540 9539 25 0 38736K 2928K wait4 S ? 0:00 sh
0 1246 9547 9540 15 0 4672K 2432K schedu S ? 0:00 mpirun
0 1246 9550 9547 15 0 15966544K 10016K schedu SL ? 0:00 adf.exe
1 1246 9551 9550 25 0 16838544K 1047344K - RL ? 242:24 adf.exe
1 1246 9552 9550 25 0 16838528K 950592K - RL ? 242:24 adf.exe
1 1246 9553 9550 25 0 16838544K 990272K - RL ? 242:21 adf.exe
1 1246 9554 9550 25 0 16838544K 952048K - RL ? 242:23 adf.exe
1 1246 9555 9550 25 0 16838544K 1044816K - RL ? 242:26 adf.exe
1 1246 9556 9550 25 0 16838544K 948704K - RL ? 242:21 adf.exe
1 1246 9557 9550 25 0 16838544K 988160K - RL ? 242:23 adf.exe
1 1246 9558 9550 25 0 16838528K 951856K - RL ? 242:23 adf.exe
150911488K 7902416K 1939:05
|
Common PBS Environment Variables
| $PBS_O_WORKDIR |
The absolute path from which the job was originally submitted from. |
| $PBS_JOBID |
The job identifier assigned to the job by the batch system. The job
identifier will typically be nnn.cosmos where nnn is a positive non-zero
integer. |
| $PBS_JOBNAME |
The job name supplied by the user. |
| $PBS_QUEUE |
The name of the queue from which the job is executed. |
| $TMPDIR |
A job's default working directory is $HOME. That is frequently undesirable
because of space limitations and lower I/O performance. Going to $TMPDIR
(=/work/$PBS_JOBID), which is created at a job's start and deleted at its end,
affords a large disk area and, typically, better I/O performance. You must
explicitly save any files you need before job completion. Preferably,
in batch jobs you should save such files on local disk areas, such as
/scratch/$USER or $HOME. File transfers in a batch job involving the tape
archive or a remote host should be strongly avoided because of the possible
long delays they can cause. |
Policies and Best Practices
Batch system policies are approved by the Steering Committee, review@sc.tamu.edu, and may on occassion
change to reflect changing needs and load conditions. A key policy item,
made to encourage parallel computation, with regard to queues is that
the parallel queues p16, p32, and p64 shall have higher priority. All
other queues have the same priority between themselves. The implication of
this is that jobs authorized to execute in p16, p32, and p64, will have
a higher priority in being scheduled for execution over other jobs. A farther
effect of this policy is the automatic suspension of jobs executing
under the lower priority queues. Such jobs are later automatically reactivated.
We want to express the opinion and convince you that a little care on
your part in doing certain things right will go a long way to keep cosmos
efficiently and fairly run for everyone. Very reluctantly, in order
to maintain fairness and efficiency we will on occasion prematurely terminate
jobs. The subsection Abnormal Job Termination lists common
reasons for terminating a job by the staff.
Setting Appropriate Job Resource Limits
You should not, as a matter of practice, set resource levels for your job to
maximal queue values unless you actually need to. Larger settings are harder
to satisfy and, hence, will delay your job's execution on a busy system. This
is particularly true when the resource is memory and/or the number of CPUs. Set
job resource limits to the lowest possible level consistent with a successful
completion. On this point, for example, you need to make sure that if you run
commercial code, say, Gaussian, ABAQUS, or FLUENT, the native/internal resource
limits which you specify for them and the resource limits you specify in the
#PBS -l directive MUST match. If you need help in setting the latter,
please contact the Help Desk for assistance.
Invalid Parallel Batch Jobs
Jobs requesting multiple cpus must use multiple cpus simultaneously from a
single command. Running multiple independent commands in the background in a
batch job script is NOT parallel processing and is not permitted. Just
so there is NO misunderstanding, the following example constitutes an
illustration of what is invalid parallel processing and therefore is NOT
permitted.
Abnormal Job Termination
The SC staff reserves the right to terminate batch jobs when one or a
combination of following effects occur:
- Use by your program of a larger number of cpus than its parallel
efficiency warrants.
- Use by your program of a smaller number of cpus than that specified
through PBS (-l ncpus=##). This is a particularly unacceptable practice
since it results in wasting resources that they might otherwise be used
by others. When you request, say, four cpus by setting -l ncpus=4 in the
#PBS directives, PBS sets aside four cpu slots. It knows nothing about
the actual number of cpus that your program will use.
- Submitting jobs with an artificially large wall-clock or cpu-time.
- Use/abuse of a special access queue (e.g., xlong, p16) to run a job
that could very well run in one of the common queues.
- Excessive I/O with large files, which in turn overwhelms memory
due to excessive file caching.
- Any use of large amounts of disk and/or memory that causes a significant
disruption to the smooth operation of the system.
- Delayed file transfers with source or destination hosts that are remote.
Queued Jobs Not Executing
The batch system has limits on the total number of resources a user may
use and the total number of jobs a user can run. Also, each queue has limits
on the total number of jobs it can run and a limit on the number of jobs it
can run per user.
You may find that there may be available resources (eg. cpus) but your job
may still be queued because of one of these limits. Please use the
'qstat -s jobid' command to see why your job is still queued.
Jobs Using Files From the Tape Archive
If your job requires files from the tape archive (or a remote host), we
recommend that you first manually copy these files from the archive to your,
say, /scratch directory on cosmos before you submit your job. The objective
here is to avoid possible delays during batch processing.
Dedicated Jobs
Every other Tuesday the whole machine will be available to run special
jobs and/or do maintenance. The maximum processing time per job and/or
per user is 4 wall-clock hours. Users who need dedicated use must submit
their jobs to the ded_bench queue no later 12:00 noon the previous Friday.
Use of the ded_bench queue is by permission only. Please direct your
requests for access to the Steering Committee,
review@sc.tamu.edu. Some
of the specific aspects of dedicated use may change from time to time.
Starving Jobs and Backfilling
Queued jobs may become "starved" when delayed by other long-running jobs
for some time. If the batch system cannot schedule starving
jobs due to a resource or queue limit, it will attempt to schedule other
non-starving queued jobs given the available resources. This is known as
backfilling.
High Priority Queues
Occassionally, there will be jobs in high priority queues that may
preempt lower priority jobs. In this case, jobs will be suspended until
the high priority jobs have completed.
Known Problems
MPI batch jobs are being killed
SGI's MPI requires a very high amount of virtual memory when using the default
MPI settings. Your MPI program in your PBS job script may get killed for
excessive virtual memory (vmem) usage:
=>> PBS: job killed: vmem 4219024080kb exceeded limit 536870912kb
|
There are several environment variables that you can use to lower the virtual
memory requirements of your MPI program:
- Disable memory mapping by using the MPI_MEMMAP_OFF environment variable.
- Reduce the amount of heap and stack that is memory mapped per MPI process
with the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE
environment variables respectively.
The effects on the performance of your MPI program will likely vary per MPI
application. See the mpi man page for more information about these environment
variables.
Additional Information
More information about batch processing can be obtained from the
following man pages: pbs, qsub, qstat, qdel, and pbs_resources.
|