Timing and Performance

In order to use allocated CPU time efficiently, it is in your best interest (and ours) that some effort be expended in ensuring that your code is well tuned and free of computational bottlenecks to the extent possible. When portions of code are poorly implemented, performance will suffer. You can determine where your computational bottlenecks lie by timing and profiling your code. You can also time your programs to determine the speedup of one version of your code relative to another version.

There are many ways to collect timing information for a code. Some methods measure the overall time of a program. Other methods perform more fine grained measurements and record the amounts and proportions of time spent in various parts of a program (in various functions, subroutines or other user instrumented sections of code). Some of these methods are intrusive and require that changes be made to the code to collect the timing information, while others do not require modifications to code. Some methods are also more portable across platforms than others. This means that if you want to collect timing information for a certain code on different platforms (SGI vs. IBM for instance) you may want to use a method or technique that will apply in both environments without requiring you to re-instrument your code.

When timing code for the purpose of determinig speedup, a simple formula can be used. Suppose T and T_opt refer to time measurements of your code clocked correspondingly for one version of the code versus another (say, an optimized or a parallelized version). Then we define

Speedup S = T / T_opt

as an indicator (NOT the only one!) of code performance. Under different circumstances T and T_opt represent different types of time. The various types of time are defined as follows:

Type Description
user the amount of CPU time used by the user's program
sys the amount of CPU time used by the operating system in support of the user's program
cpu the total CPU time (user + sys)
wall the wall clock time (also known as "real time")

Typically the cpu time and the wall clock time for a program on any machine will differ, unless the system is idle or there is insignificant system usage (for instance, little or no disk usage from I/O operations or no swapping/paging). A program that spends most of its time using CPU cycles is said to be "CPU bound". A program that relies heavily on performing reads and writes to disk, on the other hand, is said to be "I/O bound".

In the case of serial code that is also predominantly CPU bound, it is appropriate for T and T_opt to be elapsed cpu times. However, when the execution profile of a code is not predominantly CPU bound, real time (wall-clock) for T and T_opt is in many cases the best way to go. When measuring parallel performance, T and T_opt should represent real time measurements. In this case, T is the best 1-cpu measurement, while T_opt is the real time performance of the parallelized version. Under all cases it is best that time measurements be repeated several times, especially on a busy system. Among a series of mesurements for the same code use the one with the smallest time value.

Timing Methods

Below we list some utilities and routines for carrying out time measurments, and after that illustrate their use. Further below, we show one way of doing basic code profiling.

Routine Source Result Type Approximate
Resolution
(microseconds)
Approximate
Overhead
(microseconds)
gettimeofday OS wall 1 0.3
rtc OS (AIX) cpu 0.1 0.2
SYSTEM_CLOCK2 XL Fortran wall 1 1
dtime_, etime_ XL Fortran any 10000 1
CPU_TIME XL Fortran any 10000 1
clock_gettime OS wall 1 0.3
clock OS cpu 10000 0.8
times OS user, sys 10000 0.2
getrusage OS user, sys 10000 0.9
  1. Determined on p690 running AIX5.1 with 64 bit kernel and 32 and 64 bit binaries. This information was obtained from NCSA's website.
  2. Need to compile with the -qsclk=micro compiler switch.

To summarize the data in the table: for portability use the gettimeofday function or the SYSTEM_CLOCK function. For higher resolution timing use the rtc timer function and routine.

/usr/bin/time

The quickest way to get timing of a code is run the code within the /usr/bin/time shell command. The command will return user time , system time and the total wall time . See the man pages on time to see more information on the command, especially on formatting the output. Note that the csh and tcsh shells have a built-in command also called time.

hydra% /usr/bin/time a.out

/usr/bin/timex

The AIX utility /usr/bin/timex , similar to /usr/bin/time , reports, in seconds, the elapsed time, user time, and system execution time for a command. With specified flags, the timex command lists or summarizes process accounting data for a command and all of its children. Command is the name of any executable file on the system. It also reports total system activity during the execution interval when called with the switch -s.

hydra% /usr/bin/timex a.out

gettimeofday

If portability is a primary concern, use the routine gettimeofday. This routine provides wall clock time . As usual see the man pages for particulars on usage. It offers both microsecond resolution and overhead. It can be used as an elapsed time as is shown in the following C code fragment.

#include <stddef.h>     /* definition of NULL */
#include <sys/time.h>   /* definition of timeval struct and protyping of gettimeofday */

    double t1,t2,elapsed;
    struct timeval tp;
    int rtn;

    ....
    ....
    rtn=gettimeofday(&tp, NULL);
    t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
    ....
    /* do some work */
    ....
    rtn=gettimeofday(&tp, NULL);
    t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
    elapsed=t2-t1;

You can also make a C function that can be called from Fortran:

#include <stddef.h>  /* defines NULL */
#include <sys/time.h>

double second_()     /* note that AIX xlf does not require the trailing  '_' */
{
    struct timeval tp;
    int rtn;
    rtn=gettimeofday(&tp, NULL);

    return ((double)tp.tv_sec+(1.e-6)*tp.tv_usec);
}

SYSTEM_CLOCK

IBM XL Fortran provides the standard timer routine, SYSTEM_CLOCK. Despite what the name suggests, it provides wall clock time . When compiled with the -qsclk=micro switch, it offers milli-second resolution and micro-second overhead. It can be used as an elapsed time as is shown in the following Fortran code fragment.

       double precision elapsed
       integer C1,C2,R,M
       ....
       ....
       CALL SYSTEM_CLOCK (COUNT=C1,COUNT_RATE=R,COUNT_MAX=M)
       ....
!      do some work
       ....
       CALL SYSTEM_CLOCK (COUNT=C2,COUNT_RATE=R,COUNT_MAX=M)
       elapsed= dble(C2-C1)/dble(R)

You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.

       double precision function second()
       integer C,R,M

       CALL SYSTEM_CLOCK (COUNT=C,COUNT_RATE=R,COUNT_MAX=M)

       second = dble(C)/dble(R)
       return
       end

CPU_TIME

Another standard timer is the cpu_time routine. Despite the name suggests, it can provide any measure of time (which is non standard). Unfortunately it offers only 10 milli-second resolution. It can be used as an elapsed time as is shown in the following Fortran code fragment. It can also provide thread specific time.

       double precision dummy,t1,t2,elapsed

       call cpu_time(dummy) ! initialize
       ....
       ....
       call cpu_time(t1)
       ....
!      do some work
       ....
       call cpu_time(t2)
       elapsed=t2-t1

You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.

       double precision function second()
       double precision cputime

       call cpu_time(cputime)

       second = cputime
       return
       end

To select what time is returned, set the XLFRTEOPTS option to usertime or systime or alltime for thread specific time or total_usertime or total_systime or total_alltime for a sum of parent and child processes. total_alltime is the default.

rtc (real time clock)

The highest resolution but least portable timer is the AIX real time clock. This routine provides wall clock time .

Fortran Example:

       REAL(8) t1,t2,elapsed
       REAL(8) rtc

       ....
       ....
       t1=rtc()
       ....
!      do some work
       ....
       t2=rtc()
       elapsed=t2-t1

C Example:

#include <sys/time.h>

double c_rtc()
{
   timebasestruct_t start;

   read_real_time(&start, TIMEBASE_SZ);
   time_base_to_time(&start, TIMEBASE_SZ);
   return(start.tb_high+0.000000001*start.tb_low);

   /* For nanosecond periods, it would be better to work */
   /* with the high and low bit separately...  */
   /* subtract the starting time from the ending time    */
   /* secs = finish.tb_high - start.tb_high; */
   /* n_secs = finish.tb_low - start.tb_low; */

}

main(){

double t1,t2,elapsed;

..
..
t1 = c_rtc();

  /* do some work */

t2 = c_rtc();
elapsed=t2-t1;
..
..
}

MPI_Wtime

This subroutine returns the current value of time as a double precision floating point number of seconds. This value represents elapsed time since some point in the past. This time in the past will not change during the life of the task. You are responsible for converting the number of seconds into other units if you prefer. Refer to the MPI_Wtime man page for more details.

C

#include "mpi.h"

int main(int argc, char **argv)
{
  double start, finish;

  MPI_Init(&argc, &argv); /* must initialize MPI before calling MPI_Wtime */

  start = MPI_Wtime();

  /* do some work */

  finish = MPI_Wtime();
  printf("%g seconds elapsed.\n", finish - start);

  MPI_Finalize();  /* must finalize after use */
}

Profiling

gprof

A quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile the source code with the compiler flag for profiling: -pg. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof . The results of the analyses will be dumped to stdout.

hydra% xlf -O -pg foo.f  # or xlc -O -pg foo.c
hydra% ./a.out
hydra% gprof a.out gmon.out

The 'flat' profile will contain a useful breakdown of time spent in functions and subroutines. The 'call graph' profile contains inclusive and exclusive time spent in subroutines and functions. See the man page for information about the xlf and xlc compiler flags for profiling and see the man page on gprof for its options.

One option is to set the environment variable GMON_OUT_PREFIX to something. When profiling a threaded or MPI code, each process will generate a gmon file called $GMON_OUT_PREFIX.pid. Each gmon file can then be analyzed seperately or the aggregate sum can be produced by gmon and examined as a whole:

hydra% gprof -s $GMON_OUT_PREFIX.*
hydra% gprof foo gmon.sum

tprof

tprof is a tick based profiling tool. It provides line level profile information.

HPM Toolkit

The Hardware Performance Monitor (HPM) Toolkit measures the performance of codes running on IBM POWER3, POWER4, and POWER5 systems. It has three parts:

The hpmcount utility starts your job, and at completion it reports wall clock time, hardware performance counters information, derived hardware metrics, and resource usage statistics for your code as a whole.

The libhpm instrumentation library provides subroutines that allow you to instrument sections of your code to generate statistics and present summaries of that information for each instrumented section of code. This library supports serial and parallel (threaded, OpenMP, MPI, and hybrid) applications written in Fortran, C, and C++.

The hpmviz graphical user interface graphically displays the performance file generated by libhpm.

This collection of programs allows you to instrument your code with library calls that collect hardware data during the execution of your code. These data are statistically summarized and written to your current working directory at the end of your job run. You can set a number of options to control what hardware features are instrumented and sampled.