Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Compiling and Running CUDA Programs

Last modified: Wednesday April 10, 2013 3:00 PM

Compiling OpenACC with PGI Compilers

The OpenACC Application Program Interface provides a simple way of implementing accelerated applications, i.e., CUDA applications in this document. OpenACC describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, allowing programmers to create high-level host+accelerator programs without explicitly managing data and devices, as needed by CUDA C/C++/Fortran. All of these details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes.

Sample OpenACC Programs

Sample programs are given below to illustrate the flavor of OpenACC, in which the OpenACC directives are marked in red.

daxpy.f90
        program daxpy
        implicit none
        real*8::a
        real*8, allocatable::x(:), y(:)
        integer::n, i

        n = 10000
        allocate(x(n), y(n))

        call random_number(x)
        call random_number(y)
        call random_number(a)

        !$acc kernel copyin(x, a) copy(y)
        do i = 1,n
          y(i) = y(i)+a*x(i)
        enddo
        !$end acc end kernel

        deallocate(x,y)

        end program daxpy
daxpy.c
        #include <stdlib.h>
        #include <time.h>
        #define N 10000

        int main(void){
            double *x, *y, a;
            int i;

            x = (double *)malloc(N*sizeof(double));
            y = (double *)malloc(N*sizeof(double));

            // assign random values to x, y, and a
            srand(time(NULL));
            a = (double)rand()
            for (i = 0; i < N; i++){
                x[i] = (double)rand()/(double)RAND_MAX;
                y[i] = (double)rand()/(double)RAND_MAX;
            }

        #pragma kernel copyin(x, a), copy(y)
        {
            for (i = 0; i < N; i++){
                y[i] += a*x[i];
            }
        }

            free(x);
            free(y);
            return 0;
        }

OpenACC is fully supported by the PGI compilers on Eos. The PGI compiler module file must be loaded before using the compilers.

    module load pgi/compilers

The command line form of invoking the PGI compilers for OpenACC codes are as follow:

    pgfortran -acc [options] -o acc_prog.exe acc_prog.f90 ...
    pgcc      -acc [options] -o acc_prog.exe acc_prog.c ...
    pgCC      -acc [options] -o acc_prog.exe acc_prog.c++ ...

Common Options

The options mentioned previously can apply to OpenACC code. In addtion, the table below lists some options related to OpenACC code. Among the options, -Minfo=accel is highly recommended at compile time. The compiler feedback from this option is useful in checking whether the device code is correctly generated, whether data are correctly placed, or if the device code is reasonably optimized.

Option Description
-acc Enable OpenACC pragmas and directives to explicitly parallelize regions of code for execution by accelerator devices.
-ta=nvidia,cc20 Compile the accelerator regions for NVIDIA GPU with compute capability 2.0. This is the supported value on Eos
-Minfo=accel Emit information about accelerator region targeting.
-Minline[=option[,option,...]] This option is needed when when a routine is referenced inside an accelerator region. The routine must be inlined.

Examples

    pgfortran -acc -o daxpy.exe daxpy.f90
    pgfortran -acc -o daxpy.exe -ta=nvidia,cc20 -v daxpy.f90
    pgfortran -acc -Minfo daxpy.exe daxpy.f90
    pgcc      -acc -Minfo=accel -fast -O3 daxpy.exe daxpy.c
    pgCC      -acc -mp -Minfo -fast -O3 acc_mp_prog.exe acc_mp_prog.cpp
    pgfortran -acc -o acc_prog.exe -Minline=sub1 acc_prog.f90

Environment Variables

Environment Variables Description
PGI_ACC_NOTIFY If set to 1, the runtime prints a line of output each time a kernel is launched on the GPU.

If set to 2, the runtime prints a line of output about each data transfer.

If set to 3, the runtime prints out kernel launching and data transfer.
PGI_ACC_TIME If set to 1, the runtime summarizes the time taken for data movement between the host and the GPU, and computation on the GPU.