Compiling & Running Programs

Last update on Monday, 04-Jun-2007 11:42:10 CDT.

Compilers

On agave there are compilers for FORTRAN, C, and C++. These in turn have different names for different programming paradigms (serial, MPI, OpenMP, etc.) The following table provides a brief overview of commonly invoked compiler command (driver) names.

Default Language Serial File Suffix MPI OpenMP Mixed
Fortran 77 xlf .f mpxlf xlf_r mpxlf_r
Fortran 90 xlf90 .f   or   .f90 mpxlf90 xlf90_r mpxlf90_r
Fortran 95 xlf95 .f mpxlf95 xlf95_r mpxlf95_r
C xlc   or   cc .c mpcc xlc_r   or   cc_r mpcc_r
C++ xlC .C   or   .c mpCC xlC_r mpCC_r

Serial Programs

To compile and link serial (sometimes called scalar) source codes you need to use one (or more) of the following commands:

Fortran:
xlf90 [-o exec_file] [options] files [-L path] -l libs
C:
xlc   [-o exec_file] [options] files [-L path] -l libs
C++:
xlC   [-o exec_file] [options] files [-L path] -l libs

where files are the input source or object (.o suffix) files, and options are any compatible combination of compiler and linker options available. Linker options are specified to the right of compiler options. Once the executable, exec_file, has been generated, just run it as any other command.

Compiler Options

Listed below are brief explanations for some specific options of particular importance.

Option Description
-o exec_file Specifies that the name of the executable file be exec_file. The default name is a.out.
-c Disables the load step (no executable generated) and writes the binary object file suffixed by .o.
-q32 or -q64 Generates 32-bit executables, which is the default executable file format. This option is unrelated to program-defined data sizes. We recommend the use of 64-bit as a standard practice, since it automatically allows for a much larger memory allocation and addressing range. See also comments below about the OBJECT_MODE environment variable.
-qsuffix=f=f90 Fortran90 only. Specifies that the default suffix for source files is .f90, instead of IBM's default, .f
-qfixed Specifies that the Fortran input source code is in fixed (default for f77) format.
-qfree Specifies that the Fortran input source code is in free format.

Instead of using the -q32 or -q64 options, you can also set the OBJECT_MODE environment variable:

Environment Variable Description
OBJECT_MODE It can be set either to generate 32-bit or 64-bit binaries. It also specifies whether 32-bit or 64-bit libraries are used when linking. Setting this environment variable combines the action of both -q32 (or 64) and -b32 (or 64). 32-bit and 64-bit binaries cannot be mixed. See the EXAMPLE section for an example.

Linking options

Option Description
-lname Searches the library called libname.a or libname.so for external routines that are referenced in the program. A library is searched when its name is encountered, so the placement of a -l operand is significant.
-Lpath Changes the library search algorithm for the loader. For directory, path, specify the path to a directory that should be searched before using those of the default system libraries. You can specify multiple -L options on the command line. The library search algorithm searches these directories in left to right order.

Code optimization options

Option Description
-qarch=pwr4 Generates object binaries appropriate for the POWER4 architecture
-qtune=pwr4 Specifies that the architecture for which the program is to be optimized be that of the IBM POWER4 processor
-qhot Performs high-order transformations to maximize the efficiency of loops and array language. Some of the transformations may slightly change program semantics, but this can be avoided by also using the -qstrict option.
-O[n] Specifies level of optimization. n can be 2, 3, 4, or 5. Higher levels include and combine a progressively larger number of different types of optimizations. Levels 3, 4, and 5 include the -qhot mentioned above. Options 3, 4 and 5 are aggressive options. On occasion (e.g., for numerically unstable codes), they may alter program semantics. The potential however can be very substantial.

Debugging options

Option Description
-qreport=hotlist Produces report showing how loops were transformed in the optimization process (i.e., on account of applying one of the -O[n] options). This output is directed to the .lst suffixed file.
-qsource Produces a source listing that it directs to a .lst suffixed file
-g Generates symbol and source level line information in the targeted object files. Does not cause appreciable performance degradation nor does it affect compiler optimizations.

Example

The following will serially execute prog.exe in 32-bit mode, carry out level 3 optimizations for the POWER4 architecture, link to the IBM's engineering and scientific libray (essl), as well as to the LAPACK linear algebra package (the 32-bit version).

agave % xlf90 -o prog32.exe -O3 files -qstrict -qarch=pwr4 -qtune=pwr4 \
       -lessl -L/usr/local/lib -llapack
agave % prog32.exe

The following generates the 64-bit version of the above example.

agave % export OBJECT_MODE=64   (Bourne or Korn Shell)
agave % xlf90 -o prog64.exe -O3 -qstrict files -qarch=pwr4 -qtune=pwr4 \
     -lessl -L/usr/local/lib -llapack64
agave % prog64.exe

Note that both libraries to be linked are serial versions.

OpenMP and Multi-threaded (non-MPI) Programs

To compile and run standard OpenMP codes or codes using pthreads on agave use a non-MPI compiler that has the _r name ending.

Fortran:
xlf90_r  -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs
C:
xlc_r  -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs
C++:
xlC_r   -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs

where files are the input source or object (.o suffix) files, and options are any compatible combination of compiler and linker options available. Linker options are specified to the right of compiler options.

Compiler options

Listed below are brief explanations for the above arguments and options, as well as some specific options of particular importance.

Option Description
-qsmp=omp Enables shared memory parallelization where it is specified through SMP (Symmetric Multi- processing) directives, whether $OMP (OpenMP) or other IBM-specific ones
-qsmp Enables automatic multi-threaded code parallelization in the manner of the shared memory model. This option is the same as -qsmp:auto. Its range of action includes the -qhot option. Because the latter has the potential to change program semantics, the use of -qsmp should be used with caution.
-qreport=smplist Produces a report showing how the program is parallelized. Potentially very usefull.

Other important options that can be used include those detailed in the section for Serial Programs above (such as -o, -q32, -q64, -O[n], -L, -l, and other optimization and debugging options). These options carry the same meanings for the compilation of both serial and parallel programs.

Environment variables

After the compilation itself, you must set appropriate OpenMP and other environment variables, if needed by your program, such as OMP_NUM_THREADS, OMP_SCHEDULE, OMP_DYNAMIC, and AIXTHREAD_SCOPE. A fuller list is shown in the table below.

Environment Variable Description
OMP_NUM_THREADS Sets the number of threads to use during execution, unless that number is explicitly changed by calling the OpenMP subroutine, OMP_SET_NUM_THREADS.
OMP_SCHEDULE Sets the schedule type and (optionally) the chunk size for DO and standalone PARALLEL DO loops declared with a schedule of RUNTIME. For these loops, the schedule is set at run time when the system reads the value of this environment variable. Valid values for this variable are STATIC, DYNAMIC, and GUIDED. The default value for this environment variable is STATIC.
OMP_DYNAMIC=FALSE or TRUE (default=TRUE) Disables (=FALSE) or enables (=TRUE) dynamic adjustment of the number of threads available for the execution of parallel regions. Enabling dynamic thread adjustment may add a small amount of overhead.
AIXTHREAD_SCOPE=S or P (default=P) Sets the thread contention scope to be system (S) or process (P). When system contention scope is used, each user thread is directly mapped to one kernel thread. This is the appropriate setting for most scientific applications in which one wants the user threads to map one-to-one to processors. Process contention scope (P) is best when there are many more threads than processors. When process contention scope is used, user threads share a kernel thread with other (process contention scope) user threads in the process.
MALLOCMULTIHEAP=[true] MALLOCMULTIHEAP=[{heaps:##],considersize] (default = not set) Malloc multiheap feature: creates a fixed number of heaps. Each memory allocation request is serviced using one of the available heaps. When set to TRUE it enables the configuration of 32 memory heaps. To specify a lower number of heaps, use instead the heaps:## form. Thus, for each thread to allocate (in a round-robin way) and manage its own heap, use MALLOCMULTIHEAP=heaps:n, where n can be the number of threads/cpus desired. Round-robin allocation means that all heaps are used, needed or not. To bypass round-robin usage and instead allocate space from the first available heap, also use the "considersize" suboption. The considersize option may be slower but it helps reduce the working set size and the number of sbrk() calls. When the multiheap feature is not activated, only one thread at a time (i.e., serially) can issue malloc(), free(), or realloc() calls, all affecting a single heap.
XLSMPORTS=stack=n (SMP stack size (default = 4 MB/thread) Sets a thread's run-time stack size, where n is in bytes. Remember, that for 32-bit applications the total stack size for all threads in a process cannot exceed 256 MB. This limitation does not apply to 64-bit applications.
SPINLOOPTIME=n (default = 40) Sets the number of times a user thread will spin-idle when it cannot acquire a lock (e.g., before it begins a parallel loop) When the spin count has been exhausted, the thread will go to sleep waiting for a lock to become available ... unless the YIELDLOOPTIME environment variable is set to a number greater than zero. You want to spin rather than sleep if you are waiting for a previous parallel loop to complete, provided there is not too much sequential work between the parallelized loops. If YIELDLOOPTIME is set, upon exhausting the spin count, the thread issues the yield() system call, gives up the processor, but stays in a runnable state rather than going to sleep. On a quiet system, yielding is preferable to sleeping since reactivating the thread after sleeping is more costly. On a busy system, SPINLOOPTIME should not be set too large, otherwize valuable processor time that could be shared with other jobs is consumed spinning. Some experimentation may be required to reach an optimal condition.
YIELDLOOPTIME=n (default = 0) Used only when SPINLOOPTIME is also set. It sets the number of times that the system yields a processor when trying to acquire a busy spin lock before going to sleep. The processor is yielded to another kernel thread, if one is available.

Now you are ready to invoke the program by entering the program name on the command line.

Examples

Assume that the OMP_SCHEDULE parameter on all parallelization directives for loops has been set to RUNTIME and that the name of the executable is prog.exe. Then the following will execute prog.exe on four processors and use "dynamic" scheduling for the distribution of loop iterations across the four processors. Also assume that our program makes calls to ESSL, IBM's engineering and scientific library, as well as, to LAPACK (64-bit precision now), the scalar linear algebra library.

agave % xlf90_r -qsmp=omp -q64 -o prog.exe -O2 files -qtune=pwr4 \
        -lesslsmp -L/usr/local/lib -llapack64

agave % export OMP_SCHEDULE=DYNAMIC   (Bourne or Korn Shell)
agave % export OMP_NUM_THREADS=4
agave % export OMP_DYNAMIC=FALSE
agave % export AIXTHREAD_SCOPE=S
agave % export MALLOCMULTIHEAP=heaps:4
agave % export XLSMPORTS=stack=100000000
agave % export OBJECT_MODE=64
agave % export SPINLOOPTIME=10000
agave % export YIELDLOOPTIME=4000

agave % prog.exe

Note that we link with the SMP version of the ESSL library and with the regular LAPACK library. The latter is available only in the scalar (non-parallel) version.

MPI Programs

To compile and run standard mpi codes on agave you need to do three things:

  1. Your .rhosts file in your home directory and the host.list file in the directory where you plan to execute your mpi program must exist and include the entry, agave.tamu.edu.

                   .rhosts file                    host.list file
                   ______________                  ______________
      in $HOME    |agave.tamu.edu|   in execution |agave.tamu.edu|
      directory   |      :       |   directory    |agave.tamu.edu|
                  |      :       |                |agave.tamu.edu|
                  |              |                |       :      |
                  |______________|                |       :      |
                                                  |______________|
    

    The second file, host.list, must have multiple such line entries, at least as many as the maximum number of processors you plan to use.

  2. Compile and link your MPI code with the following commands:

    Fortran:
    mpxlf90_r -o exec_file [options] files -l libs
    C:
    mpcc_r -o exec_file [options] files -l libs
    C++:
    mpCC_r -cpp -o exec_file [options] files -l libs

    where files are the input source or object (.o suffix) files, and options are any compatible combination of compiler and linker options available. The compilation options listed in the section for serial programs can also be used for MPI programs.

    Note, the -cpp flag is required when using the MPI C++ bindings.

  3. Execute your MPI program under the Parallel Operating Environment (POE) by invoking the poe command.

    poe exec_file [exec_file options] [poe options] 

Some of the poe options of concern are listed below:

Option Description
-procs nn Specifies that the number of processors to run your program in parallel be set to nn.
-shared_memory yes Specifies that MPI will use shared memory protocol (NOT IP) for message passing between two or more tasks within the same IBM Regatta. Make sure that you set this option to "yes", because the default ("no") results in much lower performance.
-infolevel n Specifies the level of message reporting. The default is 1 (warning and error). Higher levels (2,3,...,6) provide progressively more diagnostic information.
-wait_mode poll Directs that an MPI thread engage in polling, when blocked waiting for a message to arrive, in order to detect such arrivals. (Other nonoptimal values are yield, sleep, and nopoll)
-resd no Specifies that the Partition Manager should NOT connect to LoadLeveler when running a MPI program. (IBM's LoadLeveler has been retired from use as the system's batch facility)

Most of poe's command-line options, including the above, can also be set with environment variables:

agave % export MP_PROCS=n   (Borne or Korn Shell)
agave % export MP_SHARED_MEMORY=yes
agave % export MP_INFOLEVEL=n (0,1,..,6)
agave % export MP_WAIT_MODE=poll
agave % export MP_RESD=no

agave % setenv MP_PROCS n   (C-Shell)
agave % setenv MP_SHARED_MEMORY yes
agave % setenv MP_INFOLEVEL n
agave % setenv MP_WAIT_MODE poll
agave % setenv MP_RESD no

Examples

Make the .rhosts and host.list files available as specified in step 1.

agave % mpxlf90_r -qhot -o prog.exe prog.f
agave % poe prog.exe -procs 2 -shared_memory yes

agave % mpxlf90_r -O2 -q64 -o prog.exe prog.f -l pessl_r
agave % poe prog.exe -procs 4 -shared_memory yes

In the second example the program is linked to IBM's parallel and thread-safe engineering and scientific library.

Additional Information

More in-depth information about the compilers and MPI can be found here.