|
Tuning an applications involves determining the source of performance problems and then
fixing those problems to increase the performance of your program. Performance gains usually
involve one of three categories of measured time:
- User time: amount of CPU time used by the user's program
- System time: amount of CPU time used by the system in support of the user's program (e.g. I/O)
- Wall clock time: elapsed real time
Tuning the performance of an application usually involves the following steps:
- Analyzing and identifying the performance problem
- Finding the location of the performance problem in the source code
- Fixing the performance problem
Time can be measured simply by using the /usr/bin/time command with your
program. The following shows the usage of the command and some example
output:
/usr/bin/time -p prog.exe
real 253.93
user 253.52
sys 0.14
|
Timing Routines
Timing routines are useful for obtaining timing information only for certain sections
of code in your program. You must insert the timing routines at the beginning and end
of regions of interest. The table below shows several common timing routines:
| gettimeofday | OS | wall clock | >0.7 |
| getrusage | OS | wall clock/user/system | >0.4 |
| omp_get_wtime | OpenMP | wall clock | 1 (value from omp_get_tick) |
| MPI_Wtime | MPI Library | wall clock | 0.04 (value from MPI_Wtick) |
For the gettimeofday and getrusage routines, it is often practical to create
wrapper routines to simplify the use of these routines.
Profiling with Sampling
There are three possible sampling methods:
- Instruction Pointer (IP) Sampling (Most common)
- Hardware Event Sampling
- Call Stack Sampling
There are around 100 countable events with each of them separately
configurable. There are a set of tools for performance analysis and bottleneck
identification. You will need to load the histx module to be able to use the
performance monitoring tools
Instruction Pointer Sampling
To get the instruction pointer (IP) sampling report for a program:
histx -o out prog.exe
iprep out.prog.exe.* > report.all
|
The first line samples the IP with the timer as the default event source at
1Khz (every 1000 cycles). The raw IP sampling reports are saved into files
with names out.prog.exe.NNNN where NNNN refers to the Process ID.
The second line generates a report from one or more raw IP sampling reports
from histx. If your program was compiled with -g, you can also use the -l
option with histx which will show line-level counts of the IP sampling.
The following shows an example report from iprep. The count column shows the totals
of sampled events for each function. Self % is the percentage of samples of a function
compared to the entire execution. Cum. % is the percentage of a function AND
all descendants of that function.
Functions sorted by count
========= ====== == ====
Count Self % Cum. % Name
-------------------- ------- ------- --------------------
62700 23.716 23.716 a.out:jacld_
57815 21.868 45.584 a.out:blts_
54970 20.792 66.376 a.out:buts_
48626 18.392 84.768 a.out:jacu_
32718 12.375 97.143 a.out:rhs_
5928 2.242 99.385 a.out:ssor_
1396 0.528 99.913 a.out:l2norm_
|
Hardware Event Sampling
One may obtain the values of the Itanium Processor Family (IPF) performance
monitor event counts for the entire execution of an application. The Itanium-2
processor on cosmos can only count up to 4 events simultaneously. The list of events
can be obtained with the following command:
The default event is CPU cycles (CPU_CYCLES). Some common events are:
- L2 cache misses (L2_MISSES)
- Retired 64-bit instructions (IA64_INST_RETIRED.THIS)
- Retired floating point operations (FP_OPS_RETIRED)
Application Level Hardware Event Monitoring
The following example samples four hardware events using performance monitor counters.
lipfpm -f -e CPU_CYCLES -e L2_MISSES -e IA64_INST_RETIRED.THIS -e FP_OPS_RETIRED -o lipfpm.out ./myprog
cosmos% cat lipfpm.out.applu.ser.32152
lipfpm summary
====== =======
CPU Cycles................................................. 327605356563
L2 Misses.................................................. 2156963019
Retired IA-64 Instructions -- Retired IA-64 Instructions... 405934784012
Retired FP Operations...................................... 164187161175
Average number of Itanium instructions per cycle........... 1.02391
Average MFLOP/s............................................ 651.05257
Average MB/s requested by L2............................... 1095.058235
|
The output is stored in lipfpm.out.myprog.PID where PID is the process ID of
the running program, myprog. The -f option tells lipfpm to follow any forked
processes which is necessary for most parallel programs.
Procedure Level Hardware Event Monitoring
Hardware event monitoring can also be performed at the procedural level.
The Perl script, profile.pl, runs profile experiments using pfmon and dplace.
Use of dplace controls which processors are used for measurement. Disabling of
dplace for profile.pl requires careful consideration (see profile.pl man page).
The default rate and event is 100 interrupts per second and the CPU_CYCLES
event counter.
An example of the report produced by profile.pl from a program that is run
on processor 0. This processor is a logical CPU since batch jobs use dynamic
cpusets:
profile.pl -c0 ./myprog
user ticks: 25265 99.68 %
kernel ticks: 82 0.32 %
idle ticks: 0 0 %
Using /proc/ksyms as the kernel map file.
====================================================================
Kernel
Ticks Percent Cumulative Routine
Percent
--------------------------------------------------------------------
81 98.78 98.78 ia64_spinlock_contention_end
1 1.22 100.00 __insmod_st_S.data_L536
====================================================================
====================================================================
User
Ticks Percent Cumulative Routine
Percent
--------------------------------------------------------------------
5980 23.67 23.67 jacld_
5533 21.90 45.57 blts_
5240 20.74 66.31 buts_
4639 18.36 84.67 jacu_
3131 12.39 97.06 rhs_
563 2.23 99.29 ssor_
141 0.56 99.85 l2norm_
15 0.06 99.91 _fini
11 0.04 99.95 erhs_
4 0.02 99.97 exact_
2 0.01 99.98 f_newfileop
1 0.00 99.98 _end
1 0.00 99.98 $_1$GetNextFormat$TAG$GLOB
1 0.00 99.99 setiv_
1 0.00 99.99 IA64_printReg_If
1 0.00 100.00 IA64_Region_epilogue
1 0.00 100.00 _dl_new_object
====================================================================
|
Call Stack Sampling
Sometimes, the control flow that led to the function being called needs to be understood.
Call stack sampling can monitor events that match a particular call stack.
The following can be used to sample the call stack for a maximum of 5 frames.
histx -s callstack5 prog.exe
csrep prog.exe.* > report.all
|
- Use the -s option to specify the sampling type
- Specify the maximum number of stack frames to profile
- Helps to identify the control flow leading to a function being called
The butterfly report for each function is organized as follows:
Frequency of appearance of each function (caller) that calls the given function
Frequency of appearance of the given function
Frequnecy of appearance of each function (callee) that is called by the given function.
|
The frequencies for caller or callee have two values. The first frequency is in relation
to the number of callstacks containing a given function. The second frequency that is in
parentheses is in relation to the total number of callstacks.
The following shows a sample output of csrep. The first report contains a listing of
the frequency of appearance of each function found in the call stack. The second report
contains a butterfly report for each function.
Appearance Frequency
========== =========
100.00% a.out:main
100.00% libc.so.6.1:__libc_start_main
100.00% a.out:_start
99.94% a.out:ssor_
23.72% a.out:jacld_
21.87% a.out:blts_
20.73% a.out:buts_
18.40% a.out:jacu_
12.38% a.out:rhs_
0.55% a.out:l2norm_
Butterfly Report
========= ======
--------------------------------------------------------------------------------
100.00% (100.00%) libc.so.6.1:__libc_start_main+0x440 [libc-start.c:205]
....... (100.00%) a.out:main
23.72% ( 23.72%) a.out:ssor_+0x940
21.87% ( 21.87%) a.out:ssor_+0x9e0
20.73% ( 20.73%) a.out:ssor_+0xa90
18.40% ( 18.40%) a.out:ssor_+0x9f0
12.34% ( 12.34%) a.out:ssor_+0x1360
2.26% ( 2.26%) a.out:ssor_
--------------------------------------------------------------------------------
100.00% (100.00%) a.out:_start+0x80
....... (100.00%) libc.so.6.1:__libc_start_main
99.94% ( 99.93%) a.out:main+0x1b70
--------------------------------------------------------------------------------
....... (100.00%) a.out:_start
100.00% (100.00%) libc.so.6.1:__libc_start_main+0x440 [libc-start.c:205]
--------------------------------------------------------------------------------
100.00% ( 99.93%) a.out:main+0x1b70
....... ( 99.94%) a.out:ssor_
24.29% ( 23.72%) a.out:jacld_
22.39% ( 21.87%) a.out:blts_
21.22% ( 20.73%) a.out:buts_
18.84% ( 18.40%) a.out:jacu_
12.67% ( 12.38%) a.out:rhs_
0.56% ( 0.55%) a.out:l2norm_
|
More Information
For information about performance profiling, please consult:
The Performance Monitoring Chapter in the SGI Altix Tutorials
The Performance Analysis and Debugging Chapter of the Linux Application Tuning Guide at SGI's Techpubs Library
|