Timing and PerformanceLast update on Thursday, 07-Jun-2007 10:05:25 CDT. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In order to use allocated CPU time efficiently, it is in your best interest (and ours) that some effort be expended in ensuring that your code is well tuned and free of computational bottlenecks to the extent possible. When portions of code are poorly implemented, performance will suffer. You can determine where your computational bottlenecks lie by timing and profiling your code. You can also time your programs to determine the speedup of one version of your code relative to another version. There are many ways to collect timing information for a code. Some methods measure the overall time of a program. Other methods perform more fine grained measurements and record the amounts and proportions of time spent in various parts of a program (in various functions, subroutines or other user instrumented sections of code). Some of these methods are intrusive and require that changes be made to the code to collect the timing information, while others do not require modifications to code. Some methods are also more portable across platforms than others. This means that if you want to collect timing information for a certain code on different platforms (SGI vs. IBM for instance) you may want to use a method or technique that will apply in both environments without requiring you to re-instrument your code. When timing code for the purpose of determinig speedup, a simple formula can
be used. Suppose
as an indicator (NOT the only one!) of code performance. Under different circumstances
Typically the cpu time and the wall clock time for a program on any machine will differ, unless the system is idle or there is insignificant system usage (for instance, little or no disk usage from I/O operations or no swapping/paging). A program that spends most of its time using CPU cycles is said to be "CPU bound"; a program that relies heavily on performing reads and writes to disk, on the other hand, is said to be "I/O bound". In the case of serial code that is also predominantly CPU bound, it is appropriate for T and T_opt to be elapsed cpu times. However, when the execution profile of a code is not predominantly CPU bound, real time (wall-clock) for T and T_opt is in many cases the best way to go. When measuring parallel performance, T and T_opt should represent real time measurements. In this case, T is the best 1-cpu measurement, while T_opt is the real time performance of the parallelized version. Under all cases it is best that time measurements be repeated several times, especially on a busy system. Among a series of mesurements for the same code use the one with the smallest time value. Below we list some utilities and routines for carrying out time measurments, and after that illustrate their use. Further below, we show one way of doing basic code profiling.
To summarize the data in the table: for portability use the
/usr/bin/timeThe quickest way to get timing of a code is run the code within the /usr/bin/time shell command. The command will return user time , system time and the total wall time . See the man pages on time to see more information on the command, especially on formatting the output. Note that the csh and tcsh shells have a built-in command also called time.
/usr/bin/timexThe AIX utility /usr/bin/timex , similar to /usr/bin/time , reports, in seconds, the elapsed time, user time, and system execution time for a command. With specified flags, the timex command lists or summarizes process accounting data for a command and all of its children. Command is the name of any executable file on the system. It also reports total system activity during the execution interval when called with the switch -s.
gprofA quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile to source code with the compiler flags for profiling: '-pg'. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof. The results of the analyses will be dumped to stdout. The 'flat' profile will contain a useful breakdown of time spent in functions and subroutines. The 'call graph' profile contains inclusive and exclusive time spent in subroutines and functions. See the man pages on the XL compilers for information about the compiler flags for profiling and see the man page on gprof for its options.
gettimeofdayIf portability is a primary concern, use the routine gettimeofday. This routine provides wall clock time . As usual see the man pages for particulars on usage. It offers both microsecond resolution and overhead. It can be used as an elapsed time as is shown in the following C code fragment.
You can also make a C function that can be called from Fortran:
SYSTEM_CLOCKIBM XL Fortran provides the standard timer routine, SYSTEM_CLOCK. Despite what the name suggests, it provides wall clock time . When compiled with the -qsclk=micro switch, it offers milli-second resolution and micro-second overhead. It can be used as an elapsed time as is shown in the following Fortran code fragment.
You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.
CPU_TIMEAnother standard timer is the cpu_time routine. Despite the name suggests, it can provide any measure of time (which is non standard). Unfortunately it offers only 10 milli-second resolution. It can be used as an elapsed time as is shown in the following Fortran code fragment. It can also provide thread specific time.
You can also make a Fortran function that can be called from C. When compiling with the C compiler, you will need to provide the following libraries to the linker: -lxlf -lxlf90.
To select what time is returned, set the XLFRTEOPTS option to usertime or systime or alltime for thread specific time or total_usertime or total_systime or total_alltime for a sum of parent and child processes. total_alltime is the default. rtc (real time clock)The highest resolution but least portable timer is the AIX real time clock. This routine provides wall clock time . Fortran
C
MPI_WtimeThis subroutine returns the current value of time as a double precision floating point number of seconds. This value represents elapsed time since some point in the past. This time in the past will not change during the life of the task. You are responsible for converting the number of seconds into other units if you prefer. Refer to the MPI_Wtime man page for more details. C
ProfilinggprofA quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile the source code with the compiler flag for profiling: -pg. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof . The results of the analyses will be dumped to stdout.
The 'flat' profile will contain a useful breakdown of time spent in functions and subroutines. The 'call graph' profile contains inclusive and exclusive time spent in subroutines and functions. See the man page for information about the xlf and xlc compiler flags for profiling and see the man page on gprof for its options. One option is to set the environment variable GMON_OUT_PREFIX to something. When profiling a threaded or MPI code, each process will generate a gmon file called $GMON_OUT_PREFIX.pid. Each gmon file can then be analyzed seperately or the aggregate sum can be produced by gmon and examined as a whole:
tproftprof is a tick based profiling tool. It provides line level profile information. HPMThe Hardware Performance Monitor (HPM) Toolkit measures the performance of codes running on IBM POWER3, POWER4, and POWER5 systems. It has three parts:
The hpmcount utility starts your job, and at completion it reports wall clock time, hardware performance counters information, derived hardware metrics, and resource usage statistics for your code as a whole. The libhpm instrumentation library provides subroutines that allow you to instrument sections of your code to generate statistics and present summaries of that information for each instrumented section of code. This library supports serial and parallel (threaded, OpenMP, MPI, and hybrid) applications written in Fortran, C, and C++. The hpmviz graphical user interface graphically displays the performance file generated by libhpm. This collection of programs allows you to instrument your code with library calls that collect hardware data during the execution of your code. These data are statistically summarized and written to your current working directory at the end of your job run. You can set a number of options to control what hardware features are instrumented and sampled. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||