Automatic Parallelization with the MIPSpro APO Compilers
By Dror Maydan, Manager, Compilers
With its Origin line of servers and supercomputers, Silicon Graphics continues its unmatched support for shared memory parallelism, delivering systems that allow single application binaries to scale from single processor O2 to 128-processor Origin systems. Cache-coherent, shared-memory systems bring with them the promise of scalability using incremental, easy-to-program methods. This article gives an overview of shared-memory programming environments and tools using the MIPSpro Auto-Parallelizing Option (APO) compilers. It does not cover parallel programming on UNICOS® systems nor does it discuss library-based approaches such as pthreads.
The MIPSpro compilers provide two complementary approaches to exploiting parallelism. The MIPSpro APO uses compiler technology to automatically detect and exploit parallelism in sequential Fortran (77 and 90), C, and C++ programs. The new OpenMP cross-vendor standard, pioneered by Silicon Graphics, allows developers to directly control the parallelism by inserting directives into their code. The two approaches can be used individually or can be freely mixed. In conjunction with these systems, Silicon Graphics provides a graphical tool, WorkShop Pro MPF, to assist programmers in evaluating and exploiting the parallelism in their code.
The APO transfers the burden of parallelizing loops within a code from the user to the compiler. In the ideal case, this allows a serial application to perform well on a parallel computer with little or no intervention from the user. Of course the compiler is not able to parallelize every loop that a human being can. There are codes, particularly those utilizing complex or irregular data structures and those requiring very coarse levels of parallelization, that cannot be effectively parallelized by the compiler. Nonetheless, for a large class of programs exhibiting data parallelism on regular data sets, the APO provides a very easy way for the user to gain parallel performance. As one example, on the SPEC95 floating point benchmarks, the APO is able to successfully parallelize six out of 10 programs.
The APO was released as a companion product to the MIPSpro 7.2 compilers. It replaces the preprocessors used in the earlier Power Fortran and Power C compilers (pfa and pca respectively). APO provides significantly more powerful optimizations, increasing the scope of programs that can be parallelized. In addition, APO is based on a fundamentally different model than the earlier preprocessors. The Power compilers were based on source to source preprocessing. One compiler, the preprocessor, would translate serial Fortran or C programs into new Fortran or C programs annotated with parallel directives. The resultant programs would be fed into the standard MIPSpro compilers. In other words, the user program would be translated by two separate compilers. In contrast, the APO is an integrated portion of the optimizer in the MIPSpro compilers. This approach offers several advantages to the customer. First, the customer gets a uniform technology across different languages. The same parallelizer is invoked whether the source program was in Fortran, C, or C++. With the preprocessor model, separate products were invoked for Fortran and C, and no product was available for C++. Second, the user gets a uniform interface into the compiler; there is one set of compiler options for the parallelizer and the rest of the optimizer. Finally, invoking only one compiler leads to a smoother compilation process, resulting not only in easier debugging but also improved performance, compile time, and robustness.
Using the APO
To invoke the APO, simply add -apo to the compile and link line.
Note: The -apo option was added with the 7.2.1 compilers. If you are using the 7.2 compilers, you must use either -pfa or -pca. For compatibility reasons, 7.2.1 users can still use either -pfa or -pca in lieu of -apo.
f77 -apo -O3 -c foo.f
f77 -mp foo.o -o foo
The -O3 option is not required but the type of optimizations done by the APO work best in conjunction with the -O3 optimizations.
The APO offers a set of listing and display options to help the user determine what was parallelized and the reasons why certain loops were not parallelized. Often when the APO fails to parallelize a loop, the user can use the listing mechanism as a guide to help modify the code. The listing mechanism is invoked by adding the list argument to apo on the compile line:
f77 -apo list -O3 -c foo.f
generates a file foo.l that describes what the parallelizer did. As an example, consider the following foo.f:1. SUBROUTINE sub(arr, n) 2. REAL*8 arr(n) 3. DO i = 2, n 4. arr(i) = arr(i) + arr(i-1) 5. END DO 6. DO i = 1, n 7. arr(i) = arr(i) + 7.0 8. CALL foo(a) 9. END DO 10. DO i = 1, n 11. arr(i) = arr(i) + 7.0 12. END DO 13. END
The compiler will generate the following foo.l file:Parallelization Log for Subprogram sub_ 3: Not Parallel Array dependence from arr on line 4 to arr on line 4. 6: Not Parallel Call foo on line 8. 10: PARALLEL (Auto) __mpdo_sub_1
The first loop on line 3 is not parallelized because it is not legal to parallelize it. The second loop on line 6 is not parallelized because the compiler is not able to analyze the effects of the call to foo. If the user knows that the loop is indeed parallel in spite of the call, the user can help the APO by placing a C*$* ASSERT CONCURRENT CALL directive in front on the loop. Alternatively, the user can choose to manually parallelize the loop using OpenMP directives. With the next version of the compiler (7.3, due next year), the APO will automatically be able to parallelize loops containing call statements when used in conjunction with the interprocedural analysis (-IPA) option. IPA allows the compiler to perform optimizations and analyses that span multiple procedures or subroutines. Currently, the APO only takes advantage of this capability via inlining. With 7.3, APO will be able to analyze the regions of memory accessed via a call, and APO will therefore be able to parallelize loops containing calls even if those call are not inlined. This should greatly expand the ability of APO to find and exploit coarse grain parallelism. Finally, the loop on line 10 is parallelized by the APO. The (Auto) means that the APO decided to parallelize the loop; the user did not parallelize it using an OpenMP directive. When parallelizing a loop, the APO encloses the parallel loop in a new subroutine, in this case __mpdo_sub_1. Knowing the name of the subroutine allows the user to manually map back profiling or debugging information to the original source positions.
In addition to the listing mechanism, for Fortran 77 and C programs, the APO allows the user to visualize the code after parallelization. This feature is enabled via the -mplist option. Given the following foo.f:SUBROUTINE trivial(a) REAL a(10000) DO i = 1,10000 a(i) = 0.0 END DO END
Fortran 77 -apo -mplist -O3 -c foo.f produces the following transformation file named foo.w2f.f:C *********************************************************** C Fortran file translated from WHIRL Sun Dec 7 16:53:44 1997 C *********************************************************** SUBROUTINE trivial(a) IMPLICIT NONE REAL*4 a(10000_8) C C **** Variables and functions **** C INTEGER*4 i C C **** statements **** C C PARALLEL DO will be converted to SUBROUTINE __mpdo_trivial_1 C$OMP PARALLEL DO private(i), shared(a) DO i = 1, 10000, 1 a(i) = 0.0 END DO RETURN END ! trivial
Future versions of the compiler will support this feature for C++ and Fortran 90 as well.
The APO is a good solution for users who are not able to devote large resources to parallelizing their codes or for users who have reasonably straightforward, data parallel codes. For users who want more control and portability, OpenMP provides a simple, easy-to-use, directive-based approach to parallelization. Since OpenMP was the topic of an article in the September/October 1997 Developer News, we will not go into much detail about the programming model in this article. In brief, OpenMP is a directive-based approach to parallelism with support for fine-grain parallelism through the !$OMP PARALLEL DO directive and coarse-grain parallelism through the parallel region (!$OMP PARALLEL) directive. It is supported by Silicon Graphics, Intel, IBM, Compaq, HP, Sun, and a host of third-party compiler, tools, and application companies. The Silicon Graphics 7.2.1 compilers support the full specification in Fortran 77 and Fortran 90. Work is continuing among the vendors to define a binding for C and C++, and we expect to deliver an implementation shortly after the definition is finalized. In the mean time, our C and C++ compilers support much of the same functionality through older, proprietary directives (refer to the C Language Reference Manual or the MIPSpro C and C++ Pragmas book).
To use OpenMP (or the older directives in C/C++), simply compile and link with the -mp flag. These directives are fully compatible with the APO. It is perfectly possible and reasonable to parallelize portions of your program with OpenMP directives and rely on the APO to parallelize other portions of your program. OpenMP directives are also compatible with the older-style directives. You can parallelize one subroutine (or one loop) with OpenMP and another with the old style directives. The only restriction is that you cannot use both styles of directives on the same loop nest. However, the compiler does provide two options to turn off recognition of OpenMP or the old-style directives:
f77 -mp -MP:open_mp=off
f77 -mp -MP:old_mp=off
These options allow you to switch between old- and new-style directives at compile time.
Parallel programs, whether the APO or OpenMP, are invoked as if they were serial programs. To control the number of processors used,
setenv OMP_NUM_THREADS P
where P is any literal number. While for testing purposes, you can set P to be higher than the number of processors on the machine, for performance reasons it is almost always better to limit the number of processors used to be no more than the number available on the machine.
With the 7.2.1 compilers, the libmp multiprocessing runtime library has changed to use "dynamic threads" by default. Previously, if the user asked for P processors, the user got P processors. While this always worked, it could lead to large inefficiencies when there were fewer than P idle processors on the machine. With dynamic threads, the system will automatically cut back on the number of processors used whenever the load on the system is too high. If you wish to guarantee that you get P processors, you can disable dynamic threads using
setenv OMP_DYNAMIC FALSE
Whether you rely on the APO, OpenMP, or some mixture of the two, WorkShop Pro MPF (a companion product to the APO) gives you a graphical interface to analyze and optimize your Fortran 77 parallel programs. Future versions will support Fortran 90, C, and C++ as well.
To use the product, compile your program, say foo.f, using
f77 -O3 -apo keep -c foo.f
The keep option causes the APO to create informational files used by the MPF tools. Given these files, invoke the tool using
cvpav -f foo.f
Figure 1 shows the tool applied to our earlier example. The tool lists all the loops and MP constructs (parallel regions, loops, critical sections) in the code. We have selected the middle loop, the one that could not be parallelized because of the call statement. The tool informs us that the loop cannot be parallelized due to a "call to foo."
Figure 1: WorkShop ProMPF Main Window
In Figure 2 we see the code after parallelization (similar to the view given by mplist). The source code for the loop we have selected is highlighted. WorkShop ProMPF works well together with profiling. By running a profiling experiment, the tool can order loops by their importance, allowing users to concentrate their attention on the important pieces of code. The tool also includes an editing function, allowing users to add OpenMP constructs into their code by using a menu system.
Figure 2: Transformed Source
SGI offers a suite of compiler and tools support to aid users in exploiting the shared memory parallelism available on our hardware. In the future, look to us to extend the power of the APO to parallelize loops across procedure boundaries and to add OpenMP and tools support to additional programming languages.
For information beyond the scope of this article, the APO is described in more detail in the MIPSpro Auto-Parallelizing Option Programmer's Guide. OpenMP is described in the September/October 1997 edition of Developer News. Further information can also be found at
For information on WorkShop Pro MPF, refer to Developer Magic: WorkShop Pro MPF User's Guide.