Using the PBS Batch System

Last update on Thursday, 24-Jan-2008 16:39:50 CST.

Batch, or batch processing, is the capability of running jobs outside of the interactive login session. In this document, batch implies a complex subsystem which provides for control over job scheduling and resource contention. On cosmos, the batch system is part of the Portable Batch System (PBS). PBS defines various queues, which are collections of ordered jobs lined up for execution. The use of the term "queue" however does not imply the ordering is "first in, first out." Each queue is defined as a set of attributes such as queue name, queue priority, queue resource limits, and job count limits. The batch system allows users to overcome resource limits imposed on interactive (sometimes referred to as "command-line") processing and to evenly and efficiently regulate the execution flow of jobs.

The interactive limit (CPU time) per login session on all systems is 20 minutes. Any violations of this limit will result in process termination. A user may only use a maximum of two processors simultaneously for interactive processing. A user is expected to lower this limit under heavy system loads. Exceptions to this policy will be considered by the staff on a per case basis. This limitation must be overcome by submitting the job in "batch mode" as described below.

PBS Job Files

A PBS batch job script is a text file with PBS directives and Unix commands. The PBS directives are always at the beginning of the file and are specified in lines that start with the #PBS keyword and continue with other job specifications. These typically describe the job's characteristics (e.g. job name, job shell, etc.) and the resources (e.g., number of cpus, memory, etc.) it needs. There are also several PBS environment variables that you should be aware of.

The following is a sample batch job file for the PBS batch system on cosmos:

#PBS -N myjob 
#PBS -S /bin/bash 
#PBS -j oe
#PBS -l walltime=4:00:00
#PBS -l ncpus=2,mem=1gb

ja -m                  # Activate Job Accounting
cd $TMPDIR

cp $PBS_O_WORKDIR/inputfile1 .
cp $PBS_O_WORKDIR/inputfile2 .
cp $PBS_O_WORKDIR/myprog .

./myprog

cp outputfile $PBS_O_WORKDIR
#
ja -st                # Output a summary resource use
# can also use: qstat -f $PBS_JOBID

The explanation of each line is listed below. Lines that begin with #PBS specify batch directives.

Line Explanation
#PBS -N myjob The name of the batch job will be myjob.
#PBS -S /bin/bash The bash shell will be used to interpret the batch job script.
#PBS -j oe The standard output and error streams will be merged into the standard output stream file. The standard output stream will be implicitly stored in $PBS_O_WORKDIR/jobname.oNNN where jobname is the name of the job and NNN is job identifier.
#PBS -l walltime=4:00:00
#PBS -l ncpus=2,mem=1gb
This job requests 4 hours of walltime, 2 cpus, and 1 GB of physical memory.
ja -m Activates job accounting, which captures, among other, cpu time, memory size, and I/O use when the job terminates.
cd $TMPDIR Make $TMPDIR the job's working directory.
cp $PBS_O_WORKDIR/inputfile1 .
cp $PBS_O_WORKDIR/inputfile2 .
cp $PBS_O_WORKDIR/myprog .
Copy files to be used for the job from the job summission directory, $PBS_O_WORKDIR, directory to the $TMPDIR directory.
./myprog Execute the program myprog.
cp outputfile $PBS_O_WORKDIR Copy the output file generated by the execution of myprog to the $PBS_O_WORKDIR directory.
cd $TMPDIR Make $TMPDIR the job's working directory.
ja -st Prints summary (-s) job accounting information and terminates (-t) job accounting. Lists in cummulative figures, among other, wall-clock time, cpu time, memory, and I/O information for commands that executed between the ja -m and ja -ct lines. We recommend use of ja in all jobs.

The number of cpus specified for a job in a #PBS directive (-l ncpus=##) MUST be the same as that specified for the running of a program through its interface. Specifically, for MPI program the -np parameter of the mpirun command must be set equal to the value of ncpus above. Similarly, for OpenMP programs the value of the OMP_NUM_THREADS environment variable must be set to the same value as ncpus. This requirement also applies for commercial application programs, such as Gaussian and ABAQUS. Two sample batch job files below illustrate the point.

Sample Batch Job File for Gaussian

#PBS -N sample -j oe
#PBS -S /bin/bash
#PBS -l walltime=10:00:00,mem=500mb,ncpus=2

ja -m
# Initialize environment with the G03 B05 SCSL module
module add g03.b05.scsl

set echo # Show issue commands in output

# Copy input files to $TMPDIR
cp sample.com $TMPDIR

# Run Gaussian 03
cd $TMPDIR
g03 < sample.com

# Copy output file to home directory
cp sample.log $HOME

# Get CPU time and other info about job
ja -st
qstat -f $PBS_JOBID

The Gaussian input and/or the Default.route file must specify the same number of cpus as the PBS ncpus argument. The job output will goto sample.oNNN where NNN is the job ID.

Sample Batch Job for ABAQUS

#PBS -N test_axi1 -S /bin/bash -j oe
#PBS -l ncpus=1,walltime=22:00:00,mem=500mb,vmem=5gb

ja -m
# uncomment this line if the abaqus module is not in your module initlist
# module load abaqus

cd $TMPDIR
cp $PBS_O_WORKDIR/axi1.inp .

abaqus job=test_axi1 cpus=1 input=axi1.inp

cp test_axi1.* $PBS_O_WORKDIR
ja -st

The ABAQUS ncpus argument must match the PBS ncpus argument. The job output will goto test_axi1.oNNN where NNN is the job ID.

Job Submission: The qsub command

Use the qsub command to submit a job as shown below:

cosmos% qsub myjob
1234.cosmos

One of the first things that happen when a job is submitted is the assigning of a unique job id to it by PBS. You may refer to a job by using only the numerical part of the job id (eg. 1234).

Job Submission Options

A list of the more commonly useful options for submitting batch jobs is listed below:

Option Description
-e path Defines the path to be used for the standard error stream of the batch job.
-j join A join argument oe directs the merging of the standard out and standard error streams into the standard out. A join with eo merges the two streams into standard error, If the join argument is n or the option is not specified, the two streams will be two separate files.
-l resource_list Specifies resources and associated maximal levels of use by the job. Commonly used resources are ncpus, cputime, walltime, mem, vmem, and file. Resources that are not explicitly specified will cause the assumption of default values that are in effect for each queue. Additional sources of information here are the listings of the qlimit command, the qstat -Qf queue command and the pbs_resources man page.
-m mail_options Specifies which conditions under which the server will send an email message about the job.
-N name Declares a name for the job.
-o path Defines the path to be used for the standard output stream of the batch job.
-S shell Declares the shell that interprets the job script. We strongly recommend that you use the bash shell.
-v variable_list Any environment variables specified in this list will be exported from the qsub command's environment to the job's environment.
-V All environment variables will be exported from the qsub command's environment to the job's environment. We recommend that you use the -v varlist option to import only the necessary environment variables.

Queue Structure

A queue is a software structure through which PBS manages the processing of jobs. Batch queues are defined by a number of parameters of which the most important are resource limits. There are several such "execution" queues from which PBS schedules jobs for execution. Jobs are routed to the appropriate queue based upon, for the most part, a job's resource limit specifications. Some queues can be used only by special permission. These are generally the the higher priority, high-cpu queues, p16, p32, p64, and ded_bench, but xlong belongs in this category (of special permission) as well. The special-access queues must be used only for jobs that match the queues special characteristics. You can also see the output of the qlimit command at this link which is updated every 5 minutes.

PBS Resources

The following resources are the more commonly used in the PBS batch system on cosmos. Additional sources of information are the listings of the qlimit command, the qstat -Qf queue command and the pbs_resources man page.

Resource Explanation
WALLTIME Maximum amount of wall-clock time duration for the job within the system since the beginning of execution. The format is hh:mm:ss. ALL jobs should specify walltime, not just cpu time. Failure to specify walltime will cause PBS to assign a job just 5 minutes, the default value.
CPUTIME Maximum amount of cpu time that the job can consume. The format is hh:mm:ss.
MEM Maximum amount of physical resident memory that the job can occupy.
PMEM Maximum amount of physical resident memory any process can occupy belonging to the job.
VMEM Maximum virtual memory per job.
PVMEM Maximum virtual memory per process in the job.
NCPUS Maximum number of cpus allowed per job.
FILE Maximum size a file can attain per job.
MAXR Maximum number of jobs that can be executing concurrently in a given queue.
USERR Maximum number of jobs a user may run concurrently in a given queue.

Job Monitoring and PBS commands

The following commands are for common tasks involving the PBS batch system on cosmos. More information about batch processing can be obtained from the following man pages: pbs, qsub, qstat, and qdel.

Task Command
Submit a job qsub jobfile
Show running jobs. Note, Req'd Time column is CPU time, not walltime. qstat -r
Show running jobs and when they began executing. Note, Req'd Time column is CPU time, not walltime. qstat -rs
Show the jobs that are not running. Note, Req'd Time column is CPU time, not walltime. qstat -i
Show the jobs that are not running and why they are not running. Note, Req'd Time column is CPU time, not walltime. qstat -is
Show the status of all the queues qstat -q
Show which queues you have access to qaccess
Show detailed information for a given job qstat -f jobid
Show detailed information for all queues or a specific queue qstat -Qf [queue_name]
Show all jobs qstat -a
Show all jobs for a given user qstat -u user
Show the processes under a given running job p_qstat jobid
Show the status of the batch system in a manner like top. Has a built-in help screen for available commands. Note, used CPU time is not being reported accurately by PBS at this time. bmonitor
Delete a given job qdel jobid
Shows the job and queue limits of various execution queues qlimit
Find all jobs over the last N days for a given user findjobs -n N -u username
Show the job history over the last N days for a given job. Format the output to 80 columns. Note, the -w flag is necessary when the output is sent to a pipe or a file. tracejob -n N -w 80 jobid

Job Accounting

The ja command: -m -s and -t options

The ja command provides information on resource use about a whole job or segment of it. The -m option initiates job accounting action. The -s option outputs a summary report, and the -t option terminates job accounting. Its use is illustrated in the sample job file. We recommend its use in all job files. More information on ja can be found in its man page.

Job CSA Accounting - Summary Report
====================================

Job Accounting File Name         : /work/28032.cosmos/.jacct5ba5041000000e24
Operating System                 : Linux cosmos.tamu.edu ...#1 SMP Sun Jan 23 13:49
User Name (ID)                   : ? (-2)
Group Name (ID)                  : ? (-2)
Project Name (ID)                : ? (0)
Job ID                           : 0x5ba5041000000e24
Report Starts                    : 02/14/05 12:09:25
Report Ends                      : 02/14/05 12:10:38
Elapsed Time                     :           73      Seconds
User CPU Time                    :          111.6496 Seconds
System CPU Time                  :            3.2979 Seconds
Block I/O Wait Time              :            0.0000 Seconds
Raw I/O Wait Time                :            0.0000 Seconds
CPU Time Core Memory Integral    :          457.7507 Mbyte-seconds
CPU Time Virtual Memory Integral :         2007.0859 Mbyte-seconds
Maximum Core Memory Used         :          153.3438 Mbytes
Maximum Virtual Memory Used      :         1011.0312 Mbytes
Characters Read                  :          221.6048 Mbytes
Characters Written               :          231.2401 Mbytes
Blocks Read                      :            0
Blocks Written                   :            0
Logical I/O Read Requests        :        23545
Logical I/O Write Requests       :        18180
Number of Commands               :           92
System Billing Units             :          114.9475

The qstat -f jobid command

The qstat -f also provides detailed information about a job. Some notable fields in the output are resources_used, queue, qtime, comment, and etime.

% qstat -f 813
Job Id: 813.cosmos
    Job_Name = os2-bs4
    Job_Owner = gooduser@cosmos.tamu.edu
    resources_used.cpupercent = 399
    resources_used.cput = 13:18:48
    resources_used.mem = 1271568kb
    resources_used.ncpus = 4
    resources_used.vmem = 6653024kb
    resources_used.walltime = 03:19:49
    job_state = R
    queue = long
    server = cosmos
    Checkpoint = u
    ctime = Wed Jun  2 09:01:32 2004
    Error_Path = cosmos.tamu.edu:/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-4
        -h/bs4/os2-bs4.e813
    exec_host = cosmos/0*4
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jun  2 09:01:32 2004
    Output_Path = cosmos.tamu.edu:/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-
        4-h/bs4/1.g98.out
    Priority = 0
    qtime = Wed Jun  2 09:01:32 2004
    Rerunable = True
    Resource_List.cput = 96:00:00
    Resource_List.file = 10gb
    Resource_List.mem = 7880704kb
    Resource_List.ncpus = 4
    Resource_List.pcput = 96:00:00
    Resource_List.pmem = 500mb
    Resource_List.pvmem = 528gb
    Resource_List.ssinodes = 2
    Resource_List.vmem = 528gb
    session_id = 16409
    Variable_List = PBS_O_HOME=/home/gooduser,PBS_O_LANG=en_US,
        PBS_O_LOGNAME=gooduser,
        PBS_O_PATH=/opt/intel/idb73/bin:/opt/intel/8.0-20040416/bin:/usr/kerbe
        ros/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/
        opt/pbs54/bin:/scratch/web/g03/bsd:/scratch/web/g03/local:/scratch/web/
        g03/extras:/scratch/web/g03:/usr/X11R6/bin:/opt/pbs54/bin:/scratch/web/
        g03/bsd:/scratch/web/g03/local:/scratch/web/g03/extras:/scratch/web/g03
        :/home/gooduser/bin,PBS_O_MAIL=/var/mail/gooduser,PBS_O_SHELL=/bin/tcsh,
        PBS_O_HOST=cosmos.tamu.edu,
        PBS_O_WORKDIR=/scratch/web/os/os-Cp+-pph3-4-h/os2-Cp+-pph3-4-h/bs4,
        PBS_O_SYSTEM=Linux,PBS_O_QUEUE=regular
    comment = Job run on node cosmos - at Wed Jun 02 at 09:01
    alt_id = cpuset=web813.c
    etime = Wed Jun  2 09:01:32 2004

Job Process Profile

The p_qstat jobid command

The p_qstat jobid lists information about the processes making up a job.

% p_qstat 14049
Job id:         14049
Job owner:      zonk
Req mem:        15781888kb
Req cpus:       8
Req cpu time:   160:00:00 (hh:mm:ss)
Req walltime:   20:00:00 (hh:mm:ss)

 F  UID  PID PPID PRI NI       VSZK     RSSK  WCHAN STAT TTY    TIME COMMAND
 4 1246 9025 3868  25  0     38704K    2864K  wait4    S   ?    0:00 -bash
 0 1246 9092 9025  25  0     38640K    2672K  wait4    S   ?    0:00 bash
 0 1246 9097 9092  17  0     38432K    2288K ia64_r    S   ?    0:00 csh
 0 1246 9536 9097  22  0     38640K    2688K  wait4    S   ?    0:00 sh
 0 1246 9539 9536  22  0     38800K    2736K  wait4    S   ?    0:00 sh
 0 1246 9540 9539  25  0     38736K    2928K  wait4    S   ?    0:00 sh
 0 1246 9547 9540  15  0      4672K    2432K schedu    S   ?    0:00 mpirun
 0 1246 9550 9547  15  0  15966544K   10016K schedu   SL   ?    0:00 adf.exe
 1 1246 9551 9550  25  0  16838544K 1047344K      -   RL   ?  242:24 adf.exe
 1 1246 9552 9550  25  0  16838528K  950592K      -   RL   ?  242:24 adf.exe
 1 1246 9553 9550  25  0  16838544K  990272K      -   RL   ?  242:21 adf.exe
 1 1246 9554 9550  25  0  16838544K  952048K      -   RL   ?  242:23 adf.exe
 1 1246 9555 9550  25  0  16838544K 1044816K      -   RL   ?  242:26 adf.exe
 1 1246 9556 9550  25  0  16838544K  948704K      -   RL   ?  242:21 adf.exe
 1 1246 9557 9550  25  0  16838544K  988160K      -   RL   ?  242:23 adf.exe
 1 1246 9558 9550  25  0  16838528K  951856K      -   RL   ?  242:23 adf.exe

                         150911488K 7902416K                 1939:05 

Common PBS Environment Variables

PBS Environment Variable Description
$PBS_O_WORKDIR The absolute path from which the job was originally submitted from.
$PBS_JOBID The job identifier assigned to the job by the batch system. The job identifier will typically be nnn.cosmos where nnn is a positive non-zero integer.
$PBS_JOBNAME The job name supplied by the user.
$PBS_QUEUE The name of the queue from which the job is executed.
$TMPDIR A job's default working directory is $HOME. That is frequently undesirable because of space limitations and lower I/O performance. Going to $TMPDIR (=/work/$PBS_JOBID), which is created at a job's start and deleted at its end, affords a large disk area and, typically, better I/O performance. You must explicitly save any files you need before job completion. Preferably, in batch jobs you should save such files on local disk areas, such as /scratch/$USER or $HOME. File transfers in a batch job involving the tape archive or a remote host should be strongly avoided because of the possible long delays they can cause.

Policies and Best Practices

Batch system policies are approved by the Steering Committee, review@sc.tamu.edu, and may on occassion change to reflect changing needs and load conditions. A key policy item, made to encourage parallel computation, with regard to queues is that the parallel queues p16, p32, and p64 shall have higher priority. All other queues have the same priority between themselves. The implication of this is that jobs authorized to execute in p16, p32, and p64, will have a higher priority in being scheduled for execution over other jobs. A farther effect of this policy is the automatic suspension of jobs executing under the lower priority queues. Such jobs are later automatically reactivated. We want to express the opinion and convince you that a little care on your part in doing certain things right will go a long way to keep cosmos efficiently and fairly run for everyone. Very reluctantly, in order to maintain fairness and efficiency we will on occasion prematurely terminate jobs. The subsection Abnormal Job Termination lists common reasons for terminating a job by the staff.

Setting Appropriate Job Resource Limits

You should not, as a matter of practice, set resource levels for your job to maximal queue values unless you actually need to. Larger settings are harder to satisfy and, hence, will delay your job's execution on a busy system. This is particularly true when the resource is memory and/or the number of CPUs. Set job resource limits to the lowest possible level consistent with a successful completion. On this point, for example, you need to make sure that if you run commercial code, say, Gaussian, ABAQUS, or FLUENT, the native/internal resource limits which you specify for them and the resource limits you specify in the #PBS -l directive MUST match. If you need help in setting the latter, please contact the Help Desk for assistance.

Invalid Parallel Batch Jobs

Jobs requesting multiple cpus must use multiple cpus simultaneously from a single command. Running multiple independent commands in the background in a batch job script is NOT parallel processing and is not permitted. Just so there is NO misunderstanding, the following example constitutes an illustration of what is invalid parallel processing and therefore is NOT permitted.

command1 &
command2 &

Abnormal Job Termination

The SC staff reserves the right to terminate batch jobs when one or a combination of following effects occur:

  1. Use by your program of a larger number of cpus than its parallel efficiency warrants.
  2. Use by your program of a smaller number of cpus than that specified through PBS (-l ncpus=##). This is a particularly unacceptable practice since it results in wasting resources that they might otherwise be used by others. When you request, say, four cpus by setting -l ncpus=4 in the #PBS directives, PBS sets aside four cpu slots. It knows nothing about the actual number of cpus that your program will use.
  3. Submitting jobs with an artificially large wall-clock or cpu-time.
  4. Use/abuse of a special access queue (e.g., xlong, p16) to run a job that could very well run in one of the common queues.
  5. Excessive I/O with large files, which in turn overwhelms memory due to excessive file caching.
  6. Any use of large amounts of disk and/or memory that causes a significant disruption to the smooth operation of the system.
  7. Delayed file transfers with source or destination hosts that are remote.

Queued Jobs Not Executing

The batch system has limits on the total number of resources a user may use and the total number of jobs a user can run. Also, each queue has limits on the total number of jobs it can run and a limit on the number of jobs it can run per user.

You may find that there may be available resources (eg. cpus) but your job may still be queued because of one of these limits. Please use the 'qstat -s jobid' command to see why your job is still queued.

Jobs Using Files From the Tape Archive

If your job requires files from the tape archive (or a remote host), we recommend that you first manually copy these files from the archive to your, say, /scratch directory on cosmos before you submit your job. The objective here is to avoid possible delays during batch processing.

Dedicated Jobs

Every other Tuesday the whole machine will be available to run special jobs and/or do maintenance. The maximum processing time per job and/or per user is 4 wall-clock hours. Users who need dedicated use must submit their jobs to the ded_bench queue no later 12:00 noon the previous Friday. Use of the ded_bench queue is by permission only. Please direct your requests for access to the Steering Committee, review@sc.tamu.edu. Some of the specific aspects of dedicated use may change from time to time.

Starving Jobs and Backfilling

Queued jobs may become "starved" when delayed by other long-running jobs for some time. If the batch system cannot schedule starving jobs due to a resource or queue limit, it will attempt to schedule other non-starving queued jobs given the available resources. This is known as backfilling.

High Priority Queues

Occassionally, there will be jobs in high priority queues that may preempt lower priority jobs. In this case, jobs will be suspended until the high priority jobs have completed.

Known Problems

MPI batch jobs are being killed

SGI's MPI requires a very high amount of virtual memory when using the default MPI settings. Your MPI program in your PBS job script may get killed for excessive virtual memory (vmem) usage:

 =>> PBS: job killed: vmem 4219024080kb exceeded limit 536870912kb

There are several environment variables that you can use to lower the virtual memory requirements of your MPI program:

  • Disable memory mapping by using the MPI_MEMMAP_OFF environment variable.
  • Reduce the amount of heap and stack that is memory mapped per MPI process with the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE environment variables respectively.

The effects on the performance of your MPI program will likely vary per MPI application. See the mpi man page for more information about these environment variables.

Additional Information

More information about batch processing can be obtained from the following man pages: pbs, qsub, qstat, qdel, and pbs_resources.