Performance Analysis and Tuning Summary The "Original Version" results shown in the tables were achieved using automatic parallelization optimizations performed by the Power Fortran Accelerator (PFA) through the -pfa compiler switch alone. The "Tuned Version" consists entirely of compiled fortran, with no use of hand-coding or optimized math libraries. In some cases, DOACROSS directives were inserted to bypass PFA and parallelize specific loops, and in other cases code was reorganized so that PFA would perform a more effective optimization. The following is a summary of the performance limitations in the original, untuned version, and the tuning remedies that were made through source code changes and directives. Name Description, Dimensions, Tuning Notes ------ ---------------------------------------------------------- MXM Matrix Multiply, Unrolled Dimension = (256x128)x(128x64) Factors limiting performance in untuned version: (1) Parallelized dimension is smallest (64). (2) Algorithm not blocked for optimal cache reuse. (3) Power-of-2 declared dimensions cause cache-mapping collisions. Tuning remedies for these factors: (1) Transposed problem, (64x128)x(128x256), parallelized dimension 256. (2) Algorithm blocked to reuse columns in primary 16 KB cache. (3) Leading declared dimension changed from 64 to 65. CFFT2D Complex 2D FFT Dimension = 128x256 Factors limiting performance in untuned version: (1) Only innermost loops were automatically parallelizable. (2) Cache-line coherency conflicts between processors. (3) Power-of-2 declared dimensions cause cache-mapping collisions. Tuning remedies for these factors: (1) Loops renested, DOACROSS directive used to implement outer-loop parallelization. (2) Alleviated by remedy (1), but still causes degradation of performance on more than 16 processors. Therefore, if more than 16 processors are being used, we use chunk interleave scheduling that keeps the chunks larger than cache line size. As a result, only 16 processors have any work to do. (3) Leading declared dimension changed from 128 to 129. CHOLSKY Cholesky Decomposition/Solution of Banded Systems Dimension=41, Bandwidth=5, NRHS=4, Number of Matrices = 251 Factors limiting performance in untuned version: (1) Parallelized loop has only NRHS=4 iterations. (2) Cache-line coherency conflicts between processors. Tuning remedies for these factors: (1) Loops renested to parallelize loop with 251 iterations. (2) Matrices transposed to eliminate cache conflicts. BTRIX Block Tridiagonal Solver Block Size=5x5, Number of Blocks=30 , Number of Systems=30 Factors limiting performance in untuned version: (1) Inner loops and loops with 5 iterations parallelized. (2) Cache-line coherency conflicts between processors caused by (1). Tuning remedies for these factors: (1) Used DOACROSS to parallelize on number of systems, an outer loop around the BTRIX call. No speedup from 16 to 20 processors because of only 30 parallel iterations. (2) Alleviated by remedy (1). GMTRY Generate Solid-Related Matrix, Gaussian Eliminate Dimension=500 Factors limiting performance in untuned version: (1) Excellent performance as is. Tuning remedies for these factors: (1) No changes in tuned version. EMIT Emit Vortices, Pressure, Forces Dimensions=100x5 Factors limiting performance in untuned version: (1) Quite good as is. Parallel speedups limited by small problem size. Tuning remedies for these factors: (1) No changes in tuned version. VPENTA Vectorized Inversion of 3 Pentadiagonals Dimension=128, NRHS=128 Factors limiting performance in untuned version: (1) Single loop nestings performing row-operations (non-stride-1) on some portions of kernel. (2) Power-of-2 declared dimensions cause cache-mapping collisions. Tuning remedies for these factors: (1) No simple remedy. (2) Leading declared dimension changed from 128 to 129. Conclusions Summarizing the Original Version results, on two kernels, MXM and GMTRY, the automatic PFA parallelizations are excellent, although cache blocking techniques that can be automated boost MXM performance substantially. On two other kernels, EMIT and VPENTA, the parallel performance from automatic parallelization was good, limited by problem size, program structure, and memory system characteristics more than by PFA optimization choices. On the remaining three kernels, CFFT2D, CHOLSKY, and BTRIX, automatic parallel performance was poor, mostly because of program structure and memory system characteristics, but improved PFA capabilities could deliver part of the beneficial effect of the manual tuning. The Tuned Version results show that excellent parallel performance is possible on all of these kernels with minor tuning. These tuning techniques can be summarized as: (1) Achieve parallelization on the outermost loops possible and the largest iteration counts possible through renesting or use of DOACROSS directives. (2) Achieve memory locality in conjunction with (1) by transposing matrices and/or renesting loops so that inner loops operate on leftmost dimensions (stride-1) and outer loops operate on rightmost dimensions. (3) Avoid power-of-2 declared leading dimensions of arrays. (4) Use blocking algorithms where possible for reuse of data in caches. The remaining limitations to parallel performance after the tuning changes are primarily caused by the small problem sizes, highlighting the importance of scaling up the size of problems as the number of processors is increased.