Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Compiling & Running Programs

Using the IBM Compilers

On hydra there are compilers for FORTRAN, C, and C++. These in turn have different names for different programming paradigms (serial, MPI, OpenMP, etc.) The following table provides a brief overview of commonly invoked compiler command (driver) names.

Default Language File Suffix Serial MPI OpenMP Mixed
Fortran 77 .f xlf mpxlf xlf_r mpxlf_r
Fortran 90 .f, .f90 xlf90 mpxlf90 xlf90_r mpxlf90_r
Fortran 95 .f, .f90, .f95 xlf95 mpxlf95 xlf95_r mpxlf95_r
C .c xlc   or   cc mpcc xlc_r   or   cc_r mpcc_r
C++ .C   or   .c xlC mpCC xlC_r mpCC_r

Serial Programs

Below we illustrate the use of a minimum number of options and environment settings consistent with anticipated good and reliable performance. Throughout, we use the korn shell as the system's command interpreter.

Because the home directories are hosted via ethernet on "remote" systems (4 Power5 520 workstations), compiling, linking, or executing sizable files there is not recommended. You should get better response time when operating in the /scratch/username or /work/jobid directories when running in batch mode.

To compile and link serial (sometimes called scalar) source codes you need to use one (or more) of the following commands:

xlf90 -q64 [-o exec_file] [options] files [-L path] -l libs   #  Fortran
xlc   -q64 [-o exec_file] [options] files [-L path] -l libs   #  C
xlC   -q64 [-o exec_file] [options] files [-L path] -l libs   #  C++

where files are the input source or object (.o suffix) files, and options are any compatible combination of compiler and linker options available. Linker options are specified to the right of compiler options. Once the executable, exec_file, has been generated, just run it as any other command.

Compiler Options

Listed below are brief explanations for some specific options of particular importance.

Option Description
-o exec_file Specifies that the name of the executable file be exec_file. The default name is a.out.
-c Disables the load step (no executable generated) and writes the binary object file suffixed by .o.
-q32 or -q64 Generates 32-bit executables, which is the default executable file format. This option is unrelated to program-defined data sizes. We recommend the use of 64-bit as a standard practice, since it automatically allows for a much larger memory allocation and addressing range. See also comments below about the OBJECT_MODE environment variable.
-qsuffix=f=f90 Fortran90 only. Specifies that the default suffix for source files is .f90, instead of IBM's default, .f
-qfixed Specifies that the Fortran input source code is in fixed (default for f77) format.
-qfree Specifies that the Fortran input source code is in free format.

Instead of using the -q32 or -q64 options, you can also set the OBJECT_MODE environment variable:

Environment Variable Description
OBJECT_MODE It can be set either to generate 32-bit or 64-bit binaries. It also specifies whether 32-bit or 64-bit libraries are used when linking. Setting this environment variable combines the action of both -q32 (or 64) and -b32 (or 64). 32-bit and 64-bit binaries cannot be mixed. See the EXAMPLE section for an example.

Prefer 64-bit executables. Always prefer to construct 64-bit executables, if you have a choice. These, and not the 32-bit versions, include all the high performance features of the Power5+ architecture, as well as allow for a very large memory address limit. With 32-bit executables you're stuck at 4 gigabytes. But this limit is frequently exceeded when solving large problems. It is for these reasons that we consistently use the -q64 compiling option below. By the way, this 64-bit issue has nothing to do with numeric data precision.

Linking options

Option Description
-lname Searches the library called libname.a or libname.so for external routines that are referenced in the program. A library is searched when its name is encountered, so the placement of a -l operand is significant.
-Lpath Changes the library search algorithm for the loader. For directory, path, specify the path to a directory that should be searched before using those of the default system libraries. You can specify multiple -L options on the command line. The library search algorithm searches these directories in left to right order.

Code optimization options

Option Description
-qarch=auto Produces object code containing instructions that will run on the hardware platform on which it is compiled
-qtune=auto Produces object code optimized for the platfom on which it is compiled
-qhot Performs high-order transformations to maximize the efficiency of loops and array language. Some of the transformations may slightly change program semantics, but this can be avoided by also using the -qstrict option.
-O[n] Specifies level of optimization. n can be 2, 3, 4, or 5. Higher levels include and combine a progressively larger number of different types of optimizations. Levels 3, 4, and 5 include the -qhot mentioned above. Options 3, 4 and 5 are aggressive options. On occasion (e.g., for numerically unstable codes), they may alter program semantics. The potential however can be very substantial.

Debugging options

Option Description
-qreport=hotlist Produces report showing how loops were transformed in the optimization process (i.e., on account of applying one of the -O[n] options). This ouput is directed to the .lst suffixed file.
-qsource Produces a source listing that it directs to a .lst suffixed file
-g Generates symbol and source level line information in the targeted object files. Does not cause appreciable performance degradation nor does it affect compiler optimizations.

Object Files & Shared Objects

At times, you may need to compile a source file without a main function into an object file. This is useful for separating different segments of code. To compile object files, invoke the compiler with these options:

xlc -c myfunc1.c myfunc2.c -o functions.o

Note that multiple .c source files or .o object files can be compiled and linked in a single command.

You may also want to write your own shared library for use with other programs. In this case, you will need to generate your shared object file (.a or .so) so that programs that you write later can link to this library in order to call these functions. To compile a shared library libmylib.a, use the following command:

xlc mylib.c -qpic -Wl, -shared -o libmylib.a

Your output file, libmylib.a could also have been named libmylib.so.

Example

The following will serially execute prog.exe in 64-bit mode, carry out level 3 optimizations for the hydra architecture.

For example, to compile the C files
xlc -q64 -qarch=auto -O3 -o prog.exe src1 src2 fun3.o ... [ [-L path] -l libs ] 
export OBJECT_MODE=64                                                            # link only 64-bit needed system libs 
./prog.exe [ input_parameters ]                                                  # Run/Execute

OpenMP and Multi-threaded (non-MPI) Programs

To compile and run standard OpenMP codes or codes using pthreads use a non-MPI compiler that has the _r name ending.

xlf90_r -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs   #  Fortran
xlc_r   -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs   #  C
xlC_r   -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs   #  C++

where files are the input source or object (.o suffix) files, and options are any compatible combination of compiler and linker options available. Linker options are specified to the right of compiler options.

Compiler options

Listed below are brief explanations for the above arguments and options, as well as some specific options of particular importance.

Option Description
-qsmp=omp Enables shared memory parallelization where it is specified through SMP (Symmetric Multi- processing) directives, whether $OMP (OpenMP) or other IBM-specific ones
-qsmp Enables automatic multi-threaded code parallelization in the manner of the shared memory model. This option is the same as -qsmp:auto. Its range of action includes the -qhot option. Because the latter has the potential to change program semantics, the use of -qsmp should be used with caution.
-qreport=smplist Produces a report showing how the program is parallelized. Potentially very usefull.

Other important options that can be used include those detailed in the section for Serial Programs above (such as -o, -q32, -q64, -O[n], -L, -l, and other optimization and debugging options). These options carry the same meanings for the compilation of both serial and parallel programs.

Environment variables

After the compilation itself, you must set appropriate OpenMP and other environment variables, if needed by your program, such as OMP_NUM_THREADS, OMP_SCHEDULE, OMP_DYNAMIC, and AIXTHREAD_SCOPE. A more detailed list is shown in the table below.

Environment Variable Description
OMP_NUM_THREADS Sets the number of threads to use during execution, unless that number is explicitly changed by calling the OpenMP subroutine, OMP_SET_NUM_THREADS.
OMP_SCHEDULE Sets the schedule type and (optionally) the chunk size for DO and standalone PARALLEL DO loops declared with a schedule of RUNTIME. For these loops, the schedule is set at run time when the system reads the value of this environment variable. Valid values for this variable are STATIC, DYNAMIC, and GUIDED. The default value for this environment variable is STATIC.
OMP_DYNAMIC=FALSE or TRUE
(default=TRUE)
Disables (=FALSE) or enables (=TRUE) dynamic adjustment of the number of threads available for the execution of parallel regions. Enabling dynamic thread adjustment may add a small amount of overhead.
AIXTHREAD_SCOPE=S or P
(default=P)
Sets the thread contention scope to be system (S) or process (P). When system contention scope is used, each user thread is directly mapped to one kernel thread. This is the appropriate setting for most scientific applications in which one wants the user threads to map one-to-one to processors. Process contention scope (P) is best when there are many more threads than processors. When process contention scope is used, user threads share a kernel thread with other (process contention scope) user threads in the process.
MALLOCMULTIHEAP=[true]
MALLOCMULTIHEAP=[{heaps:##],considersize]
(default = not set)
Malloc multiheap feature: creates a fixed number of heaps. Each memory allocation request is serviced using one of the available heaps. When set to TRUE it enables the configuration of 32 memory heaps. To specify a lower number of heaps, use instead the heaps:## form. Thus, for each thread to allocate (in a round-robin way) and manage its own heap, use MALLOCMULTIHEAP=heaps:n, where n can be the number of threads/cpus desired. Round-robin allocation means that all heaps are used, needed or not. To bypass round-robin usage and instead allocate space from the first available heap, also use the "considersize" suboption. The considersize option may be slower but it helps reduce the working set size and the number of sbrk() calls. When the multiheap feature is not activated, only one thread at a time (i.e., serially) can issue malloc(), free(), or realloc() calls, all affecting a single heap.
XLSMPORTS=stack=n
(SMP stack size (default = 4 MB/thread))
Sets a thread's run-time stack size, where n is in bytes. Remember, that for 32-bit applications the total stack size for all threads in a process cannot exceed 256 MB. This limitation does not apply to 64-bit applications.
SPINLOOPTIME=n
(default = 40)
Sets the number of times a user thread will spin-idle when it cannot acquire a lock (e.g., before it begins a parallel loop) When the spin count has been exhausted, the thread will go to sleep waiting for a lock to become available ... unless the YIELDLOOPTIME environment variable is set to a number greater than zero. You want to spin rather than sleep if you are waiting for a previous parallel loop to complete, provided there is not too much sequential work between the parallelized loops. If YIELDLOOPTIME is set, upon exhausting the spin count, the thread issues the yield() system call, gives up the processor, but stays in a runnable state rather than going to sleep. On a quiet system, yielding is preferable to sleeping since reactivating the thread after sleeping is more costly. On a busy system, SPINLOOPTIME should not be set too large, otherwize valuable processor time that could be shared with other jobs is consumed spinning. Some experimentation may be required to reach an optimal condition.
YIELDLOOPTIME=n
(default = 0)
Used only when SPINLOOPTIME is also set. It sets the number of times that the system yields a processor when trying to acquire a busy spin lock before going to sleep. The processor is yielded to another kernel thread, if one is available.

Now you are ready to invoke the program by entering the program name on the command line.

Examples

For example, to compile and link OpenMP programs:

Fortran Compiler:   xlf95

 xlf95_r -q64 -qarch=auto -qsmp=auto:omp -O3 -o prog.exe src1.f90 src2.f90 sub3.o ... [ [-L path] -l libs ]

Note that the .f90 suffix implies "free" (default) source form. To use fixed source form specify, -qfixed[=<right_margin>], where right margin can be 132 maximum and 72 minimum. Use this option when compiling Fortran 77 source.

C/C++ Compiler:   xlc

xlc -q64 -qarch=auto -qsmp=auto:omp -O3 -o prog.exe src1 src2 fun3.o ... [ [-L path] -l libs ]

To run the OpenMP program compiled above:

        export OBJECT_MODE=64             # link only 64-bit needed system libs 
        export OMP_NUM_THREADS=4          # run on 4 threads/processors
        export AIXTHREAD_SCOPE=S          # map 1-on-1 user threads to System threads
        export MALLOCMULTIHEAP=heaps:4    # generate 4 heaps (=$OMP_NUM_THREADS) 1 heap/thread
        export OMP_SCHEDULE=DYNAMIC       # schedule loop threads using OpenMP's DYNAMIC scheduling algorithm
        ./prog.exe [ input_parameters ]   # Run/Execute

Note that a number of environment variables are important in affecting the performance of your OpenMP codes. Other useful environment variables settings that "may" be needed are:

    export OMP_DYNAMIC=FALSE
    export XLSMPOPTS=stack=100000000
    export SPINLOOPTIME=10000
    export YIELDLOOPTIME=4000

Note that the maximum useful setting for the OMP_NUM_THREADS environment variable is 16. It reflects the fact that OpenMP operates, by definition, only in a system with a shared memory environement. Here that system is a single p575 node: 16 Power5+ processors and 32 gigabytes of shared memory. For interactive runs, please limit OMP_NUM_THREADS to 8, and then only for short runs.

MPI Programs

For example, to compile MPI programs:

Fortran Compiler:   mpxlf95

mpxlf95_r -q64 -qarch=auto -O3 -o prog.exe src1.f90 src2.f90 sub3.o ... [ [-L path] -l libs ]

The .f90 suffix implies "free" (default) source form. To use fixed source form specify, -qfixed[=<right_margin>]. Use this option when compiling Fortran 77 source.

C/C++ Compiler:   mpcc_r

mpcc_r -q64 -qarch=auto -O3 -o prog.exe src1 src2 fun3.o ... [ [-L path] -l libs]

Execute your MPI program under the Parallel Operating Environment (POE) by invoking the poe command.

Parallel Operating Environment

POE is IBM's native MPI software stack and it stands for "parallel operating environment". For interactive execution of POE (MPI) code on Hydra, one must first put together two files appropriately set up, namely:

  1. an .rhosts in your login directory, and
  2. a "host.list" file in the same directory where execution takes place.

The .rhosts file lists the host names of the hosts from where the user can request interactive execution of their MPI code. These are the two Hydra nodes which are configured to execute interactive code, namely,

The host.list file lists the names of the hosts which are going to execute the MPI tasks their interactive code. This file contains one host name per line for each MPI task. Note that the total number of entries must be at least np.

Here are representaive samples of the .rhosts and host.list files.

              .rhosts file                    
              _________________________                  
in $HOME     |hydra1.tamu.edu username |   
             |f1n9 username            |
             |           :     	       |
             |_________________________|             
                                            

              host.list file
             ______________
in execution |f1n9-s       |
directory    |f1n9-s       |
	     |f1n9-s       |
	     |f1n9-s       |
             |      :      |
             |      :      |
             |_____________|

All of this business with .rhosts and host.list is relevant ONLY when using poe interactively. Here are two examples of interactive invocations that are consistent with the entries in the above files:

  poe ./prog.exe -resd no -procs 4 -euilib us -single_thread yes
  poe ./prog.exe -resd no -procs 8 -tasks_per_node 4 -nodes 2 -euilib us -single_thread yes

Almost all of poe's options can also be set through environment variables. For example, setting MP_PROCS to the needed number of CPUs is equivalent to setting the command option, -procs n. For the whole story see the poe man page. More information on some POE options:

Option Description
-procs nn Specifies that the number of tasks to run your program in parallel be set to nn. Typically, tasks are mapped 1-to-1 on processors.
-euidevice sn_single Specifies that each MPI task use only a single HPS adapter. LoadLeveler will assign tasks to adapters in a round robin manner. If a job has at least as many MPI tasks per node as there are adapters, all adapters on the node will be used.
-euidevice sn_all The HPS scheduler allocates both (2) adapters on each node to an MPI task
-euilib us Specifies that MPI communication use IBM's "user-space" protocol
-euilib ip Specifies that MPI communication use the UDP/IP transport protocol. Comparatively very slow
-shared_memory yes Specifies that MPI will use shared memory protocol (NOT IP) for message passing between two or more tasks within the same IBM p575 node. Make sure that you set this option to "yes", because the default ("no") results in much lower performance.
-single_thread yes Specifies that each MPI task will make MPI calls from within a single execution thread, even if the task may spawn multiple threads during execution. This provides a hint to the transport protocol to make use of more efficient locking meachnisms available in the IBM Power Architecture and avoid paying the penalty of the heavyweight pthread_mutex locks. However, if you use MPI-IO and/or MPI-1SC (MPI one-sided communication), then set it to "no"
-infolevel n Specifies the level of message reporting. The default is 1 (warning and error). Higher levels (2,3,...,6) provide progressively more diagnostic information.
-wait_mode poll Directs that an MPI thread engage in polling, when blocked waiting for a message to arrive, in order to detect such arrivals. (Other nonoptimal values are yield, sleep, and nopoll)

To run one of the MPI programs compiled above:

poe prog.exe [ prog input_parameters ] 	
	-resd no			# Needed only for interactive use
	-shared_memory yes 		# Uses shared memory protocol within a node
	-procs np			# Use np processors
	-nodes n			# Max n=2 for interactive & 32 for batch
	-tasks_per_node nt		# nt=1 for serial & OpenMP codes 
	-euidevice sn_single		# Use HPS single plane. Batch use only
	-euilib us			# US protocol appropriate for HPS use
	-single_thread yes		# Useful when not using MPI-IO & MPI-1SC

Other useful POE options:

        -pgmmodel {spmd | mpmd}         # Default is spmd
        -cmdfile commands_file          # File containing names of executables, useful for MPMD

Most of poe's command-line options, including the above, can also be set with environment variables:

# (Borne or Korn Shell)
export MP_PROCS=n   
export MP_SHARED_MEMORY=yes
export MP_INFOLEVEL=n (0,1,..,6)
export MP_WAIT_MODE=poll
export MP_RESD=no

# (C-Shell)
setenv MP_PROCS n   
setenv MP_SHARED_MEMORY yes
setenv MP_INFOLEVEL n
setenv MP_WAIT_MODE poll
setenv MP_RESD no

Example

Make sure the .rhosts and host.list files are available in the appropriate directories.

mpxlf90_r -qhot -o prog.exe prog.f
poe ./prog.exe -procs 2 -shared_memory yes

Note that due to CPU time or memory limits imposed on interactive processing on Hydra, you may not be able to interactively test MPI programs requiring a greater amount of such resources. In such cases you will need to submit your program for execution as a batch job.

Newer AIX Compilers

We have installed on Hydra, in a non-default installation location, the latest AIX compilers. Specifically, the new compilers are

The default compilers remain the older ones, (C/C++ v10.1 and Fortran v12.1).

To activate the new compilers, please use the following module commands:

Activating a new compiler version forces also the AIX MPI compilers (eg., mpcc_r, mpxlf_r, etc.) use these new compilers as well.

Use module help ModName to obtain pointers to HTML and PDF documentation for these new compilers. To de-activate a new compiler use the standard module unload ModName command.

The latest AIX compilers support new architectures and, more importantly, offer better compliance to new and evolving standards, including,

Please visit this IBM new compiler site to get documentation.

Older AIX Compilers

The previous AIX compilers (C/C++ V8.0, Fortran V10.1) are still available in the Hydra Cluster. These compilers may be used with older software which does not conform to the latest C++ or OMP standards.

To use the older compilers, simply load the appropriate compiler module:

Fortran V10.1:        module load xlf10.1
C V8.0:               module load xlc8.0
and use the same compiler commands as listed above.

To revert your environment back to the default compilers, unload or remove the compiler module. See the Application Environment Management page for more information on using environment modules.

Additional Information

Various environment variables can also be set to tune performance of MPI programs. Consult the man page (man mpxlf95_r, mpcc_r or poe) for detailed information.

More in-depth information about the compilers and MPI can be found here.