Timing and Performance
In order to use allocated CPU time efficiently, it is in your best interest (and ours) that some effort be expended in ensuring that your code is well tuned and free of computational bottlenecks to the extent possible. When portions of code are poorly implemented, performance will suffer. You can determine where your computational bottlenecks lie by timing and profiling your code. You can also time your programs to determine the speedup of one version of your code relative to another version.
There are many ways to collect timing information for a code. Some methods measure the overall time of a program. Other methods perform more fine grained measurements and record the amounts and proportions of time spent in various parts of a program (in various functions, subroutines or other user instrumented sections of code). Some of these methods are intrusive and require that changes be made to the code to collect the timing information, while others do not require modifications to code. Some methods are also more portable across platforms than others. This means that if you want to collect timing information for a certain code on different platforms (SGI vs. IBM for instance) you may want to use a method or technique that will apply in both environments without requiring you to re-instrument your code.
When timing code for the purpose of determinig speedup, a simple formula can
be used. Suppose T and T_opt refer to time
measurements of your code clocked correspondingly for one version of the code
versus another (say, an optimized or a parallelized version). Then we
define
Speedup S = T / T_opt
as an indicator (NOT the only one!) of code performance. Under different circumstances
T and T_opt represent different types of time. The various
types of time are defined as follows:
| Type | Description |
|---|---|
| user | the amount of CPU time used by the user's program |
| sys | the amount of CPU time used by the operating system in support of the user's program |
| cpu | the total CPU time (user + sys) |
| wall | the wall clock time (also known as "real time") |
Typically the cpu time and the wall clock time for a program on any machine will differ, unless the system is idle or there is insignificant system usage (for instance, little or no disk usage from I/O operations or no swapping/paging). A program that spends most of its time using CPU cycles is said to be "CPU bound". A program that relies heavily on performing reads and writes to disk, on the other hand, is said to be "I/O bound".
In the case of serial code that is also predominantly CPU bound, it is appropriate for T and T_opt to be elapsed cpu times. However, when the execution profile of a code is not predominantly CPU bound, real time (wall-clock) for T and T_opt is in many cases the best way to go. When measuring parallel performance, T and T_opt should represent real time measurements. In this case, T is the best 1-cpu measurement, while T_opt is the real time performance of the parallelized version. Under all cases it is best that time measurements be repeated several times, especially on a busy system. Among a series of mesurements for the same code use the one with the smallest time value.
Timing Methods
Below we list some utilities and routines for carrying out time measurments, and after that illustrate their use. Further below, we show one way of doing basic code profiling.
| Routine | Source | Result Type | Approximate Resolution (microseconds) |
Approximate Overhead (microseconds) |
|---|---|---|---|---|
| gettimeofday | OS | wall | 1 | 0.3 |
| rtc | OS (AIX) | cpu | 0.1 | 0.2 |
| SYSTEM_CLOCK2 | XL Fortran | wall | 1 | 1 |
| dtime_, etime_ | XL Fortran | any | 10000 | 1 |
| CPU_TIME | XL Fortran | any | 10000 | 1 |
| clock_gettime | OS | wall | 1 | 0.3 |
| clock | OS | cpu | 10000 | 0.8 |
| times | OS | user, sys | 10000 | 0.2 |
| getrusage | OS | user, sys | 10000 | 0.9 |
- Determined on p690 running AIX5.1 with 64 bit kernel and 32 and 64 bit binaries. This information was obtained from NCSA's website.
- Need to compile with the
-qsclk=microcompiler switch.
To summarize the data in the table: for portability use the
gettimeofday function or the SYSTEM_CLOCK function.
For higher resolution timing use the rtc timer function and
routine.
/usr/bin/time
The quickest way to get timing of a code is run the code within the /usr/bin/time shell command. The command will return user time , system time and the total wall time . See the man pages on time to see more information on the command, especially on formatting the output. Note that the csh and tcsh shells have a built-in command also called time.
hydra% /usr/bin/time a.out
/usr/bin/timex
The AIX utility /usr/bin/timex , similar to /usr/bin/time , reports, in seconds, the elapsed time, user time, and system execution time for a command. With specified flags, the timex command lists or summarizes process accounting data for a command and all of its children. Command is the name of any executable file on the system. It also reports total system activity during the execution interval when called with the switch -s.
hydra% /usr/bin/timex a.out
gettimeofday
If portability is a primary concern, use the routine gettimeofday. This routine provides wall clock time . As usual see the man pages for particulars on usage. It offers both microsecond resolution and overhead. It can be used as an elapsed time as is shown in the following C code fragment.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
You can also make a C function that can be called from Fortran:
#include <stddef.h> /* defines NULL */
#include <sys/time.h>
double second_() /* note that AIX xlf does not require the trailing '_' */
{
struct timeval tp;
int rtn;
rtn=gettimeofday(&tp, NULL);
return ((double)tp.tv_sec+(1.e-6)*tp.tv_usec);
}
SYSTEM_CLOCK
IBM XL Fortran provides the standard timer routine, SYSTEM_CLOCK. Despite what the name suggests, it provides wall clock time . When compiled with the -qsclk=micro switch, it offers milli-second resolution and micro-second overhead. It can be used as an elapsed time as is shown in the following Fortran code fragment.
double precision elapsed
integer C1,C2,R,M
....
....
CALL SYSTEM_CLOCK (COUNT=C1,COUNT_RATE=R,COUNT_MAX=M)
....
! do some work
....
CALL SYSTEM_CLOCK (COUNT=C2,COUNT_RATE=R,COUNT_MAX=M)
elapsed= dble(C2-C1)/dble(R)
You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.
double precision function second()
integer C,R,M
CALL SYSTEM_CLOCK (COUNT=C,COUNT_RATE=R,COUNT_MAX=M)
second = dble(C)/dble(R)
return
end
CPU_TIME
Another standard timer is the cpu_time routine. Despite the name suggests, it can provide any measure of time (which is non standard). Unfortunately it offers only 10 milli-second resolution. It can be used as an elapsed time as is shown in the following Fortran code fragment. It can also provide thread specific time.
double precision dummy,t1,t2,elapsed
call cpu_time(dummy) ! initialize
....
....
call cpu_time(t1)
....
! do some work
....
call cpu_time(t2)
elapsed=t2-t1
You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.
double precision function second()
double precision cputime
call cpu_time(cputime)
second = cputime
return
end
To select what time is returned, set the XLFRTEOPTS option to usertime or systime or alltime for thread specific time or total_usertime or total_systime or total_alltime for a sum of parent and child processes. total_alltime is the default.
rtc (real time clock)
The highest resolution but least portable timer is the AIX real time clock. This routine provides wall clock time .
Fortran Example:
REAL(8) t1,t2,elapsed
REAL(8) rtc
....
....
t1=rtc()
....
! do some work
....
t2=rtc()
elapsed=t2-t1
C Example:
#include <sys/time.h>
double c_rtc()
{
timebasestruct_t start;
read_real_time(&start, TIMEBASE_SZ);
time_base_to_time(&start, TIMEBASE_SZ);
return(start.tb_high+0.000000001*start.tb_low);
/* For nanosecond periods, it would be better to work */
/* with the high and low bit separately... */
/* subtract the starting time from the ending time */
/* secs = finish.tb_high - start.tb_high; */
/* n_secs = finish.tb_low - start.tb_low; */
}
main(){
double t1,t2,elapsed;
..
..
t1 = c_rtc();
/* do some work */
t2 = c_rtc();
elapsed=t2-t1;
..
..
}
MPI_Wtime
This subroutine returns the current value of time as a double precision floating point number of seconds. This value represents elapsed time since some point in the past. This time in the past will not change during the life of the task. You are responsible for converting the number of seconds into other units if you prefer. Refer to the MPI_Wtime man page for more details.
C
#include "mpi.h"
int main(int argc, char **argv)
{
double start, finish;
MPI_Init(&argc, &argv); /* must initialize MPI before calling MPI_Wtime */
start = MPI_Wtime();
/* do some work */
finish = MPI_Wtime();
printf("%g seconds elapsed.\n", finish - start);
MPI_Finalize(); /* must finalize after use */
}
Profiling
gprof
A quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile the source code with the compiler flag for profiling: -pg. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof . The results of the analyses will be dumped to stdout.
hydra% xlf -O -pg foo.f # or xlc -O -pg foo.c hydra% ./a.out hydra% gprof a.out gmon.out
The 'flat' profile will contain a useful breakdown of time spent in functions and subroutines. The 'call graph' profile contains inclusive and exclusive time spent in subroutines and functions. See the man page for information about the xlf and xlc compiler flags for profiling and see the man page on gprof for its options.
One option is to set the environment variable GMON_OUT_PREFIX to something. When profiling a threaded or MPI code, each process will generate a gmon file called $GMON_OUT_PREFIX.pid. Each gmon file can then be analyzed seperately or the aggregate sum can be produced by gmon and examined as a whole:
hydra% gprof -s $GMON_OUT_PREFIX.* hydra% gprof foo gmon.sum
tprof
tprof is a tick based profiling tool. It provides line level profile information.
HPM Toolkit
The Hardware Performance Monitor (HPM) Toolkit measures the performance of codes running on IBM POWER3, POWER4, and POWER5 systems. It has three parts:
- A utility that measures hardware performance
- An instrumentation library
- A graphical user interface
The hpmcount utility starts your job, and at completion it reports wall clock time, hardware performance counters information, derived hardware metrics, and resource usage statistics for your code as a whole.
The libhpm instrumentation library provides subroutines that allow you to instrument sections of your code to generate statistics and present summaries of that information for each instrumented section of code. This library supports serial and parallel (threaded, OpenMP, MPI, and hybrid) applications written in Fortran, C, and C++.
The hpmviz graphical user interface graphically displays the performance file generated by libhpm.
This collection of programs allows you to instrument your code with library calls that collect hardware data during the execution of your code. These data are statistically summarized and written to your current working directory at the end of your job run. You can set a number of options to control what hardware features are instrumented and sampled.