Introduction to Batch Processing

Last modified: Friday October 04, 2013 10:33 AM

The batch facility on Eos is the public version of PBS (TORQUE resource manager and the Maui scheduler). The batch queues are broadly divided into those that process batch requests/jobs from any user and those that handle requests only from users that have contributed financially towards the purchase of Eos. Batch requests by "contributor" users are scheduled to execute at the highest priority consistent with the size of their contribution; those submitted by the general user are scheduled to execute only when the requested resources (e.g., number of nodes) are available. To view the list of configured queues and their associated resource limits, run the qlimit command or view the queue status web page.

A reminder. For the purpose of computation in batch mode, the Eos cluster has 314 nodes that are powered by the Nehalem processor and 48 by the Westemere. The Nehalem-based nodes have 8 cpus/cores each, while the Westmere nodes have 12. All compute nodes, Nehalem or Westmere, have 24 giga-bytes of shared memory each. Of the 24 giga-bytes only about 22 giga-bytes are available to user processes. The above is usefull to bear in mind when constructing batch requests.

Batch Job files/Scripts

A batch request is expressed through a batch file. That is, a text file, a job script, so called, with appropriate directives and other specifications or commands. A batch file, say, sample.job, consists of the batch directives section (top part) and the (UNIX) commands section. In the latter you specify all the commands that need to be executed. All PBS directives start with the #PBS string. Here is the general layout of a common PBS jofile.


#PBS directive(s)1
#PBS directive(s)2
#PBS ...

#UNIX commands section. From here on down "#" on col 1 starts a comment
#<-- at this point $HOME is the current working directory
cmd1
cmd2
...

The UNIX command section is executed on a single node or multiple nodes. Serial and OpenMP programs execute on only one node, MPI programs can use 1-358 nodes. The default current working directory is $HOME. If that is not a practical choice for you, you should explicitly change (cd) to the directory of your choice. Many times a convenient choice is the directory you submit jobs from. PBS stores that directory's location in the PBS_O_WORKDIR environment variable. Also, by default, the executing UNIX shell is the bash shell.

To stress the importance of specifying resources correctly in PBS directives and because this specification is a frequent source of error, we first present a number of examples. After that we lay out an example of a complete job file.

#PBS -l nodes=16:nehalem:ppn=8,mem=128gb,walltime=04:00:00
or, equivalently:
#PBS -l nodes=16:nehalem:ppn=8 <-- ppn is number of cores per node
#PBS -l mem=128gb
#PBS -l walltime=04:00:00

This directive will allocate 16 Nehalem nodes that have 8 cores/node, with an aggregate memory allocation of 128 giga-bytes; that is, the per node main memory allocation will be 8 giga-bytes. Memory specification per job is mandatory. The duration (wall-clock time) of execution is specified to be a maximum 4 hours.

#PBS -l nodes=1:ppn=4,mem=20gb,walltime=14:00:00,billto=0123456788
This directive will allocate 1 node and only 4 of its cores. The aggregate memory allocation will be 20 giga-bytes. Memory specification per job is mandatory. The duration (wall-clock time) of execution is specified to be a maximum 14 hours. Finally, the number of billing units (BUs) used will be charged against the 0123456788 account.

#PBS -l nodes=2:ppn=12,mem=44gb,walltime=00:50:00
This directive will allocate 2 Westmere nodes (12 cores/node) with an aggregate memory allocation of 44 giga-bytes; that is, the per node main memory allocation will be 22 giga-bytes. Memory specification per job is mandatory. The duration (wall-clock time) of execution is specified to be a maximum 50 minutes.

#PBS -l nodes=1:ppn=12:gpus_1,mem=20gb,walltime=01:30:00 -q gpu
This directive will allocate 1 Westmere node (12 cores/node) that has (at least) one GPU on them. The aggregate memory allocation will be 20 giga-bytes. Memory specification per job is mandatory. The duration (wall-clock time) of execution is specified to be a maximum 1.5 hours.

#PBS -l nodes=2:ppn=12:gpus_1+2:ppn=12:gpus_2,mem=88gb,walltime=04:00:00 -q gpu
This directive will allocate 4 Westmere nodes (12 cores/node), 2 with 1 GPU each, and 2 with 2 GPUs each. The aggregate memory allocation will be 88 giga-bytes; that is, the per node main memory allocation will be 22 giga-bytes. Memory specification per job is mandatory. The duration (wall-clock time) of execution is specified to be a maximum 4 hours.

You submit the job script for execution using the qsub jobfile command (see below). PBS then assigns it a job id, which you can use to track your job's progress through the system. The job's priority relative to other jobs will be determined based on several factors. This priority is used to order the jobs for consideration by the batch system for execution on the compute nodes.

Example Job Script

The following job file, call it mpi_sample.job, is meant for running for upto 4 hours a 40-way MPI program by simultaneously using 5 nodes with 8 tasks running on each of them. Aggregate maximum memory is 40 giga-bytes, that is, 1 giga-byte per MPI task/process The -j oe specification will merge standard out and error into a single file, named, mpi_sample.o####, where #### is the the job's id. All commands are to be run under the bash shell. At job termination, the mpi_sample.o#### file is deposited in the directory the job was submitted from ($PBS_O_WORKDIR).

 #PBS -l nodes=5:ppn=8,walltime=04:00:00,mem=40gb
 #PBS -N mpi_sample
 #PBS -S /bin/bash
 #PBS -j oe
 #
 ## Some job PRE-processing may go here
 mkdir $SCRATCH/some/dir
 cd $SCRATCH/some/dir

 # Initialize the Intel compilers and OpenMPI to use for this job
 module load intel/compilers
 module load openmpi

 # $PBS_O_WORKDIR is the directory from which the job was submitted
 cp $PBS_O_WORKDIR/files.* .

 ## Issue the MPI command

 mpirun ./mympi arg1 arg2

 ## Some job POST-processing may go here
 cp ./files.*  $PBS_O_WORKDIR

An Explanation. Please note that the directive, #PBS -l nodes=5:ppn=8,walltime=04:00:00,mem=40gb, specifies in one-line format three different types of resources: nodes, memory and wall-clock time. No space (or white) characters should be included on either side of a comma (","), the separation character between resource types. You could have equally well expressed the same thing by entering three separate resource directives:


#PBS -l nodes=5:ppn=8
#PBS -l mem=40gb
#PBS -l walltime=04:00:00

Job Submission: the qsub command

Use the qsub command to submit a job script as shown below:


$ qsub sample.job 
1234.login006.sc.cluster.tamu

When submitted, the job is assigned a unique id. You may refer to the job using only the numerical portion of its id (eg. 1234) with the various batch system commands.

Efficient use of the batch system

On a node powered by the Nehalem processor you can have up to 8 1-way/core programs (e.g., serial ones) executing concurrently. On a Westmere node that number can be up to 12. But depending upon the specific execution and performance features of a program, the most efficient number of concurrent executions may very well be less. For a serial program that makes intensive use of memory, for example, it may be best to have only 6 or 4 concurrent executions in a node. Here is one way, using backgrounded subshells, to run a serial program, serial.exe, 8 concurrent times on a node using just one job submission.


#PBS -l node=1,ppn=8,mem=20gb,walltime=04:00:00
#PBS . . .
# ... assumes all needed files are in $PBS_O_WORKDIR
#
module load intel/compilers
#
cd $PBS_O_WORKDIR
(serial.exe < indata1 > outdata1) &
(serial.exe < indata2 > outdata2) &
. . .                             &
. . .                             &
(serial.exe < indata8 > outdata8)
#
wait 
# ------- End of multi-execution serial job ------

The above job will complete when the longest running of the 8 executions is done, say, in max_time hours. The execution wall-clock time, max_time, should be well below that of the sum of 8 1-way submissions.

A job file for running two OpenMP programs, a 2-way and a 10-way, may look something like this--when directed to a Westmere node,


#PBS -l node=1,ppn=12,mem=20gb,walltime=04:00:00
#PBS . . .
# ... assumes all needed files are in $PBS_O_WORKDIR
#
module load intel/compilers
#
cd $PBS_O_WORKDIR
(export OMP_NUM_THREADS=2; omp_prog1.exe < indata1 > outdata1) &
(export OMP_NUM_THREADS=10; omp_prog2.exe < indata2 > outdata2)
#
wait    
# ------- End of multi-execution OpenMP job ------

When the subshell construction, (cmd1; ...), becomes too unwieldy, running backgrounded shell functions are a handy way to proceed. See the example below with attached comments.

#PBS -l nodes=1:ppn=8,walltime=90:00:00,mem=16gb
#PBS  ...
# Assumes you have precreated subdirectories $SCRATCH/HIG21, ... , $SCRATCH/HIG28
# and stored in them input.HIG21, ..., input.HIG28 respectively. Standard out and
# error are re-directed to HIG21.out and HIG21.err, ..., HIG28.out and HIG28.err.
# The same executable, myprog, is run off of the job submission dir $PBS_O_WORKDIR,
# getting its standard input, in each of the above subdirectories, from input.$subjob,
# where $subjob is one of HIG21, ..., HIG28.
 
 function runsubjob() {
     subjob=$1; mkdir $SCRATCH/$subjob; cd $SCRATCH/$subjob
     cp $PBS_O_WORKDIR/input.$subjob .
     exec 1>$subjob.out 2>$subjob.err
     t_start=$(date)
     echo start subjob $subjob in $(pwd) at $t_start
     t_start=$(date +%s)
     $PBS_O_WORKDIR/myprog < input.$subjob
     t_end=$(date)
     echo finished subjob $subjob at $t_end
     t_start=$(date -d "$t_start" +%s); t_end=$(date -d "$t_end" +%s)
     (( t_elapsed = t_end - t_start )); echo $t_elapsed seconds elapsed
 }
 
 module load intel/compilers

 echo running 8 subjobs > jobsplit.out
 
 for subd in HIG21 HIG22 HIG23 HIG24 HIG25 HIG26 HIG27 HIG28
   do
     runsubjob $subd &
     echo $subd sub job running in background >> jobsplit.out
   done
 
 date >> jobsplit.out
 
 wait

One can come up with all sorts of variations of the above job files to suit specific needs. The important thing is to realize that within a single job file and, therefore, with a single job submission, you can increase your work efficiency many times over for one-node jobs.

Submission Options

Below are the commonly useful submission options. These options can be specified as #PBS option in your job script (recommended) or also on the command line for the qsub command. The qsub man page describes the available submission options in more detail.

Option Description
-l billto=acctno Specifies the billing account to use for this job. Please consult the AMS documentation for more information.
-l nodes=X:ppn=Y:gpus=Z Number of nodes (X), cores per node (Y) and gpus per node (Z) to be assigned to the job. For Nehalem and Westmere nodes, the maximum value of Y is 8 and 12, respectively. When gpus are specified, X must be between 1 and 4, and Z must be set to 1 or 2
-l walltime=HH:MM:SS The limit on how long the job can run.
-W x=NACCESSPOLICY:SINGLEJOB Specifies that PBS will not schedule other jobs on any of the engaged nodes than the present job, regardless of the value of ppn. This is a useful option when, for issues of performance for example, the sharing of nodes with other jobs is undesirable. Usage per node will be assessed 8 (or 12) * wall_clock_time.
-N jobname Name of the job. When used with the -j oe option, the job's output will be directed to a file named jobname.oXXXX where XXXX is the job id.
-S shell Shell for interpreting the job script. Recommended shell is /bin/bash.
-q queue_name Directs the submitted job to the queue_name queue. On Eos this option should be exercised only by contributors or when submitting jobs to the gpu nodes.
-m [a|b|e|n] If not specified, mail is sent only when the job is aborted.
  • a: mail is sent when the job is aborted by the batch system.
  • b: mail is sent when the job begins
  • e: mail is sent when the job ends
  • n: mail is never sent
-M email_address(es) The email addresses to send mail about the job. Using an external email address (eg. @tamu.edu, @gmail.com, etc.) is recommended.

Environment Variables

Below are some environment variables that can be accessed from a job script. See the qsub man page for the complete list of Torque environment variables.

Variable Description
$PBS_JOBID Contains the job id.
$PBS_O_WORKDIR Contains the directory from which the job was submitted.
$PBS_NODEFILE Stores the name of the file that contains the list of compute nodes assigned to the job. A useful command to capture the contents of the list may be, cat $PBS_NODEFILE.

Useful Commands

Below we list commonly used commands to track and interact with the batch system. You will find fuller (may be) explanation using the corresponding man pages.

Command Description
qsub jobfile Submits a job script (jobfile) or also can be used to start an interactive batch job.
qstat -a View information about queued jobs.
qstat -f jobid View detailed information (such as assigned nodes and resource usage) about queued jobs.
showq View information about queued jobs. Useful options are -r (show running jobs), -i (show idle jobs that are eligible for scheduling, and -b (show blocked jobs that are ineligible for scheduling due to queue policies).
checkjob jobid View status of your job with id = jobid. Use the -v option to view additional information for why a queued job is waiting.
qdel jobid
canceljob jobid (Maui)
Terminates a job with id = jobid
qlimit View the status of the queues.
jrpt jobid Lists resource use information at the OS process level. An in-house command, it is useful in tracking the efficient (or inefficient) use of resources, especially for parallel jobs.The listed data is most accurate when jobs allocate whole nodes. Periodic use may also be required in order to obtain a realistic "sample" for the listed data. Important: To invoke jrpt in the form we show, you must first add the following command to your .bashrc control file on Eos: alias jrpt='/g/software/PBSutils/jrpt/bin/jrpt'.