Compiling & Running Programs
Last update on Monday, 17-Sep-2007 17:32:33 CDT.
Using the IBM Compilers
On hydra there are compilers for FORTRAN, C, and C++. These in turn have
different names for different programming paradigms (serial, MPI, OpenMP, etc.)
The following table provides a brief overview of commonly invoked
compiler command (driver) names.
| Fortran 77 |
.f |
xlf |
mpxlf |
xlf_r |
mpxlf_r |
| Fortran 90 |
.f, .f90 |
xlf90 |
mpxlf90 |
xlf90_r |
mpxlf90_r |
| Fortran 95 |
.f, .f90, .f95 |
xlf95 |
mpxlf95 |
xlf95_r |
mpxlf95_r |
| C |
.c |
xlc or cc |
mpcc |
xlc_r or cc_r |
mpcc_r |
| C++ |
.C or .c |
xlC |
mpCC |
xlC_r |
mpCC_r |
Serial Programs
Below we illustrate the use of a minimum number of options and environment settings consistent with anticipated good and reliable performance. Throughout, we use the korn shell as the system's command interpreter.
Because the home directories are hosted via ethernet on "remote" systems (4 Power5 520 workstations), compiling, linking, or executing sizable files there is not recommended. You should get better response time when operating in the /scratch/username or /work/jobid directories when running in batch mode.
To compile and link serial (sometimes called scalar) source codes you need to use one (or more) of the following commands:
| Fortran: |
xlf90 [-o exec_file] [options] files [-L path] -l libs |
| C: |
xlc [-o exec_file] [options] files [-L path] -l libs |
| C++: |
xlC [-o exec_file] [options] files [-L path] -l libs |
where files are the input source or object (.o suffix) files, and options
are any compatible combination of compiler and linker options available. Linker
options are specified to the right of compiler options. Once the executable,
exec_file, has been generated, just run it as any other command.
Compiler Options
Listed below are brief explanations for some specific options of particular
importance.
-o exec_file |
Specifies that the name of the executable file
be exec_file. The default name is a.out. |
| -c |
Disables the load step (no executable
generated) and writes the binary object file
suffixed by .o. |
| -q32 or -q64 |
Generates 32-bit executables, which is the
default executable file format. This option is
unrelated to program-defined data sizes. We
recommend the use of 64-bit as a standard practice,
since it automatically allows for a much larger
memory allocation and addressing range.
See also comments below about the OBJECT_MODE environment
variable. |
| -qsuffix=f=f90 |
Fortran90 only. Specifies that the default
suffix for source files is .f90, instead of
IBM's default, .f |
| -qfixed |
Specifies that the Fortran input source code
is in fixed (default for f77) format. |
| -qfree |
Specifies that the Fortran input source code
is in free format. |
Instead of using the -q32 or -q64 options, you can also set the
OBJECT_MODE environment variable:
| OBJECT_MODE |
It can be set either to generate 32-bit or 64-bit binaries. It also specifies
whether 32-bit or 64-bit libraries are used when linking. Setting this
environment variable combines the action of both -q32 (or 64) and -b32
(or 64). 32-bit and 64-bit binaries cannot be mixed.
See the EXAMPLE section for an example. |
Prefer 64-bit executables. Always prefer to construct 64-bit executables, if you have a choice. These, and not the 32-bit versions, include all the high performance features of the Power5+ architecture, as well as allow for a very large memory address limit. With 32-bit executables you're stuck at 4 gigabytes. But this limit is frequently exceeded when solving large problems. It is for these reasons that we consistently use the -q64 compiling option below. By the way, this 64-bit issue has nothing to do with numeric data precision.
Linking options
-lname |
Searches the library called libname.a or libname.so
for external routines that are referenced in the
program. A library is searched when its name is
encountered, so the placement of a -l operand is
significant. |
-Lpath |
Changes the library search algorithm for the loader.
For directory, path, specify the path to a directory
that should be searched before using those of the
default system libraries. You can specify multiple -L
options on the command line. The library search
algorithm searches these directories in left to right
order. |
Code optimization options
| -qarch=auto |
Produces object code containing instructions that will run on the hardware platform on which it is compiled |
| -qtune=auto |
Produces object code optimized for the platfom on which it is compiled |
| -qhot |
Performs high-order transformations to
maximize the efficiency of loops and array
language. Some of the transformations may
slightly change program semantics, but this
can be avoided by also using the -qstrict
option. |
-O[n] |
Specifies level of optimization. n can be 2,
3, 4, or 5. Higher levels include and combine
a progressively larger number of different
types of optimizations. Levels 3, 4, and 5
include the -qhot mentioned above. Options
3, 4 and 5 are aggressive options. On occasion
(e.g., for numerically unstable codes), they
may alter program semantics. The potential
however can be very substantial. |
Debugging options
| -qreport=hotlist |
Produces report showing how loops were
transformed in the optimization process (i.e.,
on account of applying one of the -O[n]
options). This ouput is directed to the .lst
suffixed file. |
| -qsource |
Produces a source listing that it directs to a
.lst suffixed file |
| -g |
Generates symbol and source level line
information in the targeted object files.
Does not cause appreciable performance
degradation nor does it affect compiler
optimizations. |
Object Files & Shared Objects
At times, you may need to compile a source file without a main function
into an object file. This is useful for separating different segments of
code. To compile object files, invoke the compiler with these options:
xlc -c myfunc1.c myfunc2.c -o functions.o
|
Note that multiple .c source files or .o object files can be compiled
and linked in a single command.
You may also want to write your own shared library for use with other
programs. In this case, you will need to generate your shared object
file (.a or .so) so that programs that you write
later can link to this library in order to call these functions. To compile a shared library libmylib.a, use the following command:
xlc mylib.c -qpic -Wl, -shared -o libmylib.a
|
Your output file, libmylib.a could also have been named
libmylib.so.
Example
The following will serially execute prog.exe in 64-bit mode, carry out
level 3 optimizations for the hydra architecture.
For example, to compile the C files
xlc -q64 -qarch=auto -O3 -o prog.exe src1 src2 fun3.o ... [ [-L path] -l libs ]
export OBJECT_MODE=64 # link only 64-bit needed system libs
./prog.exe [ input_parameters ] # Run/Execute
|
OpenMP and Multi-threaded (non-MPI) Programs
To compile and run standard OpenMP codes or codes using pthreads
use a non-MPI compiler that has the _r name ending.
| Fortran: |
xlf90_r -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs |
| C: |
xlc_r -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs |
| C++: |
xlC_r -qsmp=omp -q64 -o exec_file [options] files [-L path] -l libs |
where files are the input source or object (.o suffix) files, and
options are any compatible combination of compiler and linker options
available. Linker options are specified to the right of compiler
options.
Compiler options
Listed below are brief explanations for the above arguments
and options, as well as some specific options of particular
importance.
| -qsmp=omp |
Enables shared memory parallelization where it
is specified through SMP (Symmetric Multi-
processing) directives, whether $OMP (OpenMP)
or other IBM-specific ones |
| -qsmp |
Enables automatic multi-threaded code parallelization in the
manner of the shared memory model. This option is the same as
-qsmp:auto. Its range of action includes the
-qhot option. Because the latter has the
potential to change program semantics, the use
of -qsmp should be used with caution. |
| -qreport=smplist |
Produces a report showing how the program is
parallelized. Potentially very usefull. |
Other important options that can be used include those detailed
in the section for Serial Programs above (such as -o, -q32, -q64,
-O[n], -L, -l, and other optimization and debugging options).
These options carry the same meanings for the compilation of both
serial and parallel programs.
Environment variables
After the compilation itself, you must set appropriate OpenMP and
other environment variables, if needed by your program, such as OMP_NUM_THREADS,
OMP_SCHEDULE, OMP_DYNAMIC, and AIXTHREAD_SCOPE. A more detailed list
is shown in the table below.
| OMP_NUM_THREADS |
Sets the number of threads to use during
execution, unless that number is explicitly
changed by calling the OpenMP subroutine,
OMP_SET_NUM_THREADS. |
| OMP_SCHEDULE |
Sets the schedule type and (optionally) the
chunk size for DO and standalone PARALLEL
DO loops declared with a schedule of
RUNTIME. For these loops, the schedule is
set at run time when the system reads the
value of this environment variable. Valid
values for this variable are STATIC,
DYNAMIC, and GUIDED. The default value for
this environment variable is STATIC. |
| OMP_DYNAMIC=FALSE or TRUE (default=TRUE) |
Disables (=FALSE) or enables (=TRUE) dynamic
adjustment of the number of threads available for
the execution of parallel regions. Enabling dynamic
thread adjustment may add a small amount of overhead. |
| AIXTHREAD_SCOPE=S or P (default=P) |
Sets the thread contention scope to be system (S)
or process (P). When system contention scope is used,
each user thread is directly mapped to one kernel thread.
This is the appropriate setting for most scientific
applications in which one wants the user threads to map
one-to-one to processors. Process contention scope (P) is best
when there are many more threads than processors. When
process contention scope is used, user threads share
a kernel thread with other (process contention scope) user
threads in the process. |
| MALLOCMULTIHEAP=[true]
MALLOCMULTIHEAP=[{heaps:##],considersize] (default = not set) |
Malloc multiheap feature: creates a fixed number of heaps. Each
memory allocation request is serviced using one of the available heaps.
When set to TRUE it enables the configuration of 32 memory heaps. To
specify a lower number of heaps, use instead the heaps:## form. Thus,
for each thread to allocate (in a round-robin way) and manage its own
heap, use MALLOCMULTIHEAP=heaps:n, where n can be the number of
threads/cpus desired. Round-robin allocation means that all heaps are
used, needed or not. To bypass round-robin usage and instead allocate
space from the first available heap, also use the "considersize" suboption.
The considersize option may be slower but it helps reduce the working
set size and the number of sbrk() calls.
When the multiheap feature is not activated, only one thread at a
time (i.e., serially) can issue malloc(), free(), or realloc() calls, all
affecting a single heap. |
| XLSMPORTS=stack=n (SMP stack size (default = 4 MB/thread) |
Sets a thread's run-time stack size, where n is in bytes. Remember,
that for 32-bit applications the total stack size for all threads in a
process cannot exceed 256 MB. This limitation does not apply to 64-bit
applications. |
| SPINLOOPTIME=n (default = 40) |
Sets the number of times a user thread will spin-idle
when it cannot acquire a lock (e.g., before it begins a parallel loop)
When the spin count has been exhausted, the thread will go to sleep
waiting for a lock to become available ... unless the YIELDLOOPTIME
environment variable is set to a number greater than zero. You want
to spin rather than sleep if you are waiting for a previous parallel
loop to complete, provided there is not too much sequential work
between the parallelized loops. If YIELDLOOPTIME is set, upon exhausting
the spin count, the thread issues the yield() system call, gives up
the processor, but stays in a runnable state rather than going to sleep.
On a quiet system, yielding is preferable to sleeping since reactivating
the thread after sleeping is more costly. On a busy system, SPINLOOPTIME
should not be set too large, otherwize valuable processor time that
could be shared with other jobs is consumed spinning. Some experimentation
may be required to reach an optimal condition. |
| YIELDLOOPTIME=n (default = 0) |
Used only when SPINLOOPTIME is also set. It sets the number of times
that the system yields a processor when trying to acquire a busy spin lock
before going to sleep. The processor is yielded to another kernel thread, if
one is available. |
Now you are ready to invoke the program by entering the program name
on the command line.
Examples
For example, to compile and link OpenMP programs:
Fortran Compiler: xlf95
xlf95_r -q64 -qarch=auto -qsmp=auto:omp -O3 -o prog.exe src1.f90 src2.f90 sub3.o ... [ [-L path] -l libs ]
|
Note that the .f90 suffix implies "free" (default) source form. To use
fixed source form specify, -qfixed[=< right_margin>], where right margin can be 132 maximum and 72 minimum. Use this option when compiling fortran 77 source.
C/C++ Compiler: xlc
xlc -q64 -qarch=auto -O3 -o prog.exe src1 src2 fun3.o ... [ [-L path] -l libs ]
|
To run the OpenMP program compiled above:
export OBJECT_MODE=64 # link only 64-bit needed system libs
export OMP_NUM_THREADS=4 # run on 4 threads/processors
export AIXTHREAD_SCOPE=S # map 1-on-1 user threads to System threads
export MALLOCMULTIHEAP=heaps:4 # generate 4 heaps (=$OMP_NUM_THREADS) 1 heap/thread
export OMP_SCHEDULE=DYNAMIC # schedule loop threads using OpenMP's DYNAMIC scheduling algorithm
./prog.exe [ input_parameters ] # Run/Execute
|
Note that a number of environment variables are important in affecting the performance of your OpenMP codes. Other useful environment variables settings that "may" be needed are:
export OMP_DYNAMIC=FALSE
export XLSMPOPTS=stack=100000000
export SPINLOOPTIME=10000
export YIELDLOOPTIME=4000
|
Note that the maximum useful setting for the OMP_NUM_THREADS environment variable is 16. It reflects the fact that OpenMP operates, by definition, only in a system with a shared memory environement. Here that system is a single p575 node: 16 Power5+ processors and 32 gigabytes of shared memory. For interactive runs, please limit OMP_NUM_THREADS to 8, and then only for short runs.
MPI Programs
For example, to compile MPI programs:
Fortran Compiler: mpxlf95
mpxlf95_r -q64 -qarch=auto -O3 -o prog.exe src1.f90 src2.f90 sub3.o ... [ [-L path] -l libs ]
|
The .f90 suffix implies "free" (default) source form. To use fixed source form specify, -qfixed[=]. Use this option when compiling fortran 77 source.
C/C++ Compiler: mpcc_r
mpcc_r -q64 -qarch=auto -O3 -o prog.exe src1 src2 fun3.o ... [[-L path] -l libs]
|
Execute your MPI program under the Parallel Operating Environment (POE) by invoking the poe command.
Parallel Operating Environment
For interactive use of poe (parallel operating environment) and prior to execution, you must also have two files appropriately set up: (1) an .rhosts in your login directory, and (2) a host.list file in the same directory where execution takes place. Its contents are host name entries, f1n1 (hydra) and/or f1n10, whose total number must be equal or greater than np. Each entry also has a "-s" appended to it. Here are representaive samples of the .rhosts and host.list files.
.rhosts file
_________________________
in $HOME |hydra.tamu.edu username |
directory |hydra2.tamu.edu username |
|f1n1 username |
|f1n10 username |
| : |
|_________________________|
host.list file
______________
in execution |f1n1-s |
directory |f1n1-s |
|f1n1-s |
|f1n1-s |
|f1n10-s |
|f1n10-s |
|f1n10-s |
|f1n10-s |
| : |
| : |
|_____________|
|
All of this business with .rhosts and host.list is relevant ONLY when using poe interactively. Here are two examples of interactive invocations that are consistent with the entries in the above files:
poe ./prog.exe -resd no -procs 4 -euilib us -single_thread yes
poe ./prog.exe -resd no -procs 8 -tasks_per_node 4 -nodes 2 -euilib us -single_thread yes
|
Almost all of poe's options can also be set through environment variables. For example, setting MP_PROCS to the needed number of CPUs is equivalent to setting the command option, -procs n. For the whole story see the poe man page. More information on some POE options:
| -procs nn |
Specifies that the number of tasks
to run your program in parallel be set
to nn. Typically, tasks are mapped 1-to-1 on processors. |
| -euidevice sn_single |
Specifies that each MPI task use only a single HPS adapter. LoadLeveler will assign tasks to adapters in a round robin manner. If a job has at least as many MPI tasks per node as there are adapters, all adapters on the node will be used. |
| -euidevice sn_all |
The HPS scheduler allocates both (2) adapters on each node to an MPI task |
| -euilib us |
Specifies that MPI communication use IBM's "user-space" protocol |
| -euilib ip |
Specifies that MPI communication use the UDP/IP transport protocol. Comparatively very slow |
| -shared_memory yes |
Specifies that MPI will use shared
memory protocol (NOT IP) for message
passing between two or more tasks within
the same IBM p575 node. Make sure that
you set this option to "yes", because
the default ("no") results in much lower
performance. |
| -single_thread yes |
Specifies that each MPI task will make MPI calls from within a single execution thread, even if the task may spawn multiple threads during execution. This provides a hint to the transport protocol to make use of more efficient locking meachnisms available in the IBM Power Architecture and avoid paying the penalty of the heavyweight pthread_mutex locks. However, if you use MPI-IO and/or MPI-1SC (MPI one-sided communication), then set it to "no" |
| -infolevel n |
Specifies the level of message
reporting. The default is 1 (warning and
error). Higher levels (2,3,...,6)
provide progressively more diagnostic
information. |
| -wait_mode poll |
Directs that an MPI thread engage in
polling, when blocked waiting for a
message to arrive, in order to detect
such arrivals. (Other nonoptimal values
are yield, sleep, and nopoll) |
To run one of the MPI programs compiled above:
poe prog.exe [ prog input_parameters ]
-resd no # Needed only for interactive use
-shared_memory yes # Uses Shared Mem protocol within a node
-procs np # Use np processors
-nodes n # Max n=2 for interactive & 38 for batch
-tasks_per_node nt # nt=1 for serial & OpenMP codes
-euidevice sn_single # Use HPS single plane. Batch use only
-euilib us # US protocol appropriate for HPS use
-single_thread yes # Useful when not using MPI-IO & MPI-1SC
|
Other useful POE options:
-pgmmodel {spmd | mpmd} # Default is spmd
-cmdfile commands_file # file containing names of executables
|
Most of poe's command-line options, including the above, can also be set
with environment variables:
export MP_PROCS=n (Borne or Korn Shell)
export MP_SHARED_MEMORY=yes
export MP_INFOLEVEL=n (0,1,..,6)
export MP_WAIT_MODE=poll
export MP_RESD=no
setenv MP_PROCS n (C-Shell)
setenv MP_SHARED_MEMORY yes
setenv MP_INFOLEVEL n
setenv MP_WAIT_MODE poll
setenv MP_RESD no
|
Example
Make sure the .rhosts and host.list files are available in the appropriate directories.
mpxlf90_r -qhot -o prog.exe prog.f
poe ./prog.exe -procs 2 -shared_memory yes
|
Note that due to CPU time or memory limits imposed on interactive
processing on Hydra, you may not be able to interactively test MPI
programs requiring a greater amount of such resources. In such cases
you will need to submit your program for execution as a batch job.
Additional Information
Various environment variables can also be set to tune performance
of MPI programs. Consult the man page (man mpxlf95_r, mpcc_r or poe) for detailed information.
More in-depth information about the compilers and MPI can be found
here.
|