Profiling on the SGI Altix

Last update on Friday, 28-Jul-2006 16:20:28 CDT.

Tuning an applications involves determining the source of performance problems and then fixing those problems to increase the performance of your program. Performance gains usually involve one of three categories of measured time:

  • User time: amount of CPU time used by the user's program
  • System time: amount of CPU time used by the system in support of the user's program (e.g. I/O)
  • Wall clock time: elapsed real time

Tuning the performance of an application usually involves the following steps:

  1. Analyzing and identifying the performance problem
  2. Finding the location of the performance problem in the source code
  3. Fixing the performance problem

Time can be measured simply by using the /usr/bin/time command with your program. The following shows the usage of the command and some example output:

/usr/bin/time -p prog.exe
real 253.93
user 253.52
sys 0.14

Timing Routines

Timing routines are useful for obtaining timing information only for certain sections of code in your program. You must insert the timing routines at the beginning and end of regions of interest. The table below shows several common timing routines:

RoutineSourceTypeResolution (microseconds)
gettimeofdayOSwall clock>0.7
getrusageOSwall clock/user/system>0.4
omp_get_wtimeOpenMPwall clock1 (value from omp_get_tick)
MPI_WtimeMPI Librarywall clock0.04 (value from MPI_Wtick)

For the gettimeofday and getrusage routines, it is often practical to create wrapper routines to simplify the use of these routines.

Profiling with Sampling

There are three possible sampling methods:

  • Instruction Pointer (IP) Sampling (Most common)
  • Hardware Event Sampling
  • Call Stack Sampling

There are around 100 countable events with each of them separately configurable. There are a set of tools for performance analysis and bottleneck identification. You will need to load the histx module to be able to use the performance monitoring tools

Instruction Pointer Sampling

To get the instruction pointer (IP) sampling report for a program:

histx -o out prog.exe
iprep out.prog.exe.* > report.all

The first line samples the IP with the timer as the default event source at 1Khz (every 1000 cycles). The raw IP sampling reports are saved into files with names out.prog.exe.NNNN where NNNN refers to the Process ID. The second line generates a report from one or more raw IP sampling reports from histx. If your program was compiled with -g, you can also use the -l option with histx which will show line-level counts of the IP sampling.

The following shows an example report from iprep. The count column shows the totals of sampled events for each function. Self % is the percentage of samples of a function compared to the entire execution. Cum. % is the percentage of a function AND all descendants of that function.

Functions sorted by count
========= ====== == ====

       Count           Self  %   Cum.  %   Name
--------------------   -------   -------   --------------------
               62700    23.716    23.716   a.out:jacld_
               57815    21.868    45.584   a.out:blts_
               54970    20.792    66.376   a.out:buts_
               48626    18.392    84.768   a.out:jacu_
               32718    12.375    97.143   a.out:rhs_
                5928     2.242    99.385   a.out:ssor_
                1396     0.528    99.913   a.out:l2norm_

Hardware Event Sampling

One may obtain the values of the Itanium Processor Family (IPF) performance monitor event counts for the entire execution of an application. The Itanium-2 processor on cosmos can only count up to 4 events simultaneously. The list of events can be obtained with the following command:

lipfpm -l

The default event is CPU cycles (CPU_CYCLES). Some common events are:

  • L2 cache misses (L2_MISSES)
  • Retired 64-bit instructions (IA64_INST_RETIRED.THIS)
  • Retired floating point operations (FP_OPS_RETIRED)

Application Level Hardware Event Monitoring

The following example samples four hardware events using performance monitor counters.

lipfpm -f -e CPU_CYCLES -e L2_MISSES -e IA64_INST_RETIRED.THIS -e FP_OPS_RETIRED -o lipfpm.out ./myprog
cosmos% cat lipfpm.out.applu.ser.32152 

lipfpm summary
====== =======
CPU Cycles................................................. 327605356563
L2 Misses.................................................. 2156963019
Retired IA-64 Instructions -- Retired IA-64 Instructions... 405934784012
Retired FP Operations...................................... 164187161175
Average number of Itanium instructions per cycle........... 1.02391
Average MFLOP/s............................................ 651.05257
Average MB/s requested by L2............................... 1095.058235

The output is stored in lipfpm.out.myprog.PID where PID is the process ID of the running program, myprog. The -f option tells lipfpm to follow any forked processes which is necessary for most parallel programs.

Procedure Level Hardware Event Monitoring

Hardware event monitoring can also be performed at the procedural level. The Perl script, profile.pl, runs profile experiments using pfmon and dplace. Use of dplace controls which processors are used for measurement. Disabling of dplace for profile.pl requires careful consideration (see profile.pl man page). The default rate and event is 100 interrupts per second and the CPU_CYCLES event counter.

An example of the report produced by profile.pl from a program that is run on processor 0. This processor is a logical CPU since batch jobs use dynamic cpusets:

profile.pl -c0 ./myprog
user ticks:             25265           99.68 %
kernel ticks:           82              0.32 %
idle ticks:             0               0 %

Using /proc/ksyms as the kernel map file.
====================================================================
                           Kernel

      Ticks     Percent  Cumulative   Routine
                          Percent
--------------------------------------------------------------------
         81       98.78    98.78      ia64_spinlock_contention_end
          1        1.22   100.00      __insmod_st_S.data_L536
====================================================================

====================================================================
                            User

      Ticks     Percent  Cumulative   Routine
                          Percent
--------------------------------------------------------------------
       5980       23.67    23.67      jacld_
       5533       21.90    45.57      blts_
       5240       20.74    66.31      buts_
       4639       18.36    84.67      jacu_
       3131       12.39    97.06      rhs_
        563        2.23    99.29      ssor_
        141        0.56    99.85      l2norm_
         15        0.06    99.91      _fini
         11        0.04    99.95      erhs_
          4        0.02    99.97      exact_
          2        0.01    99.98      f_newfileop
          1        0.00    99.98      _end
          1        0.00    99.98      $_1$GetNextFormat$TAG$GLOB
          1        0.00    99.99      setiv_
          1        0.00    99.99      IA64_printReg_If
          1        0.00   100.00      IA64_Region_epilogue
          1        0.00   100.00      _dl_new_object
====================================================================

Call Stack Sampling

Sometimes, the control flow that led to the function being called needs to be understood. Call stack sampling can monitor events that match a particular call stack.

The following can be used to sample the call stack for a maximum of 5 frames.

histx -s callstack5 prog.exe
csrep prog.exe.* > report.all
  • Use the -s option to specify the sampling type
  • Specify the maximum number of stack frames to profile
  • Helps to identify the control flow leading to a function being called

The butterfly report for each function is organized as follows:

Frequency of appearance of each function (caller) that calls the given function

Frequency of appearance of the given function

Frequnecy of appearance of each function (callee) that is called by the given function.

The frequencies for caller or callee have two values. The first frequency is in relation to the number of callstacks containing a given function. The second frequency that is in parentheses is in relation to the total number of callstacks.

The following shows a sample output of csrep. The first report contains a listing of the frequency of appearance of each function found in the call stack. The second report contains a butterfly report for each function.

Appearance Frequency
========== =========
100.00% a.out:main
100.00% libc.so.6.1:__libc_start_main
100.00% a.out:_start
 99.94% a.out:ssor_
 23.72% a.out:jacld_
 21.87% a.out:blts_
 20.73% a.out:buts_
 18.40% a.out:jacu_
 12.38% a.out:rhs_
  0.55% a.out:l2norm_

Butterfly Report
========= ======
--------------------------------------------------------------------------------
100.00% (100.00%) libc.so.6.1:__libc_start_main+0x440 [libc-start.c:205]

....... (100.00%) a.out:main

 23.72% ( 23.72%) a.out:ssor_+0x940
 21.87% ( 21.87%) a.out:ssor_+0x9e0
 20.73% ( 20.73%) a.out:ssor_+0xa90
 18.40% ( 18.40%) a.out:ssor_+0x9f0
 12.34% ( 12.34%) a.out:ssor_+0x1360
  2.26% (  2.26%) a.out:ssor_
--------------------------------------------------------------------------------
100.00% (100.00%) a.out:_start+0x80

....... (100.00%) libc.so.6.1:__libc_start_main

 99.94% ( 99.93%) a.out:main+0x1b70
--------------------------------------------------------------------------------

....... (100.00%) a.out:_start

100.00% (100.00%) libc.so.6.1:__libc_start_main+0x440 [libc-start.c:205]
--------------------------------------------------------------------------------
100.00% ( 99.93%) a.out:main+0x1b70

....... ( 99.94%) a.out:ssor_

 24.29% ( 23.72%) a.out:jacld_
 22.39% ( 21.87%) a.out:blts_
 21.22% ( 20.73%) a.out:buts_
 18.84% ( 18.40%) a.out:jacu_
 12.67% ( 12.38%) a.out:rhs_
  0.56% (  0.55%) a.out:l2norm_

More Information

For information about performance profiling, please consult:

  • The Performance Monitoring Chapter in the SGI Altix Tutorials
  • The Performance Analysis and Debugging Chapter of the Linux Application Tuning Guide at SGI's Techpubs Library