Compiling and Running CUDA Programs

Last modified: Wednesday April 10, 2013 12:21 PM

Compiling CUDA C/C++ with nvcc

CUDA C is the C interface to the CUDA parallel computing platform. It consists of a minimal set of extensions to the C programming language that allow users to program the GPU directly using a high level languange. It also consists of a runtime library of C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application.

CUDA C++ only supports a subset of C++ for the device code, as described in the CUDA C programming guide.

Sample CUDA C/C++ Source Code: daxpy.cu

daxpy.cu is a sample CUDA program that illustrates the typical five-step operations we discussed earlier. In the example, key words from the CUDA extensions are colored in red and functions from the runtime library are colored in blue

        #include <stdlib.h>
        #include <time.h>
        #define N 1024

        // declare the kernel
        __global__ void daxpy(int n, double a, double *x, double *y){
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < N){
                y[i] += a*x[i];
            }
        }

        int main(void){
            double *x, *y, a, *dx, *dy;
            x = (double *)malloc(sizeof(double)*N);
            y = (double *)malloc(sizeof(double)*N);
            // initialize x and y
            srand(time(NULL));
            
            
            // allocate device memory for x and y
            cudaMalloc(dx, N*sizeof(double));
            cudaMalloc(dy, N*sizeof(double));
            // copy host memory to device memory
            cudaMemcpy(dx, x, N*sizeof(double), cudaMemcpyHostToDevice);
            cudaMemcpy(dy, y, N*sizeof(double), cudaMemcpyHostToDevice);
            // launch the kernel function
            daxpy<<<N/64,64>>>(N, a, dx, dy);
            // copy device memory to host memory
            cudaMemcpy(y, dy, N*sizeof(double), cudaMemcpyDeviceToHost);
            // deallocate device memory
            cudaMemFree(dx);
            cudaMemFree(dy);
            free(x);
            free(y);
        }

CUDA source files, programs that use the CUDA C/C++ extensions and the runtime library, must have a .cu suffix and must be compiled with nvcc, a compiler driver provided in the NVIDIA CUDA Toolkit. nvcc hides the intricate details of CUDA compilation from users. Under the cover, host code is compiled with gcc, while device functions are compiled with NVIDIA proprietary compilers/assemblers.

Before using nvcc, the module file for CUDA toolkit must be loaded:

    module load cuda/toolkit

The command line form for invoking nvcc is:

    nvcc        [options] -o cuda_prog.exe file1 file2 ...

where file1, file2, ... are any appropriate source, assembly, object, object library, or other (linkable) files that are linked to generate the executable file cuda_prog.exe.

Common Options

The table below shows a very limited subset of the available options. For further exploration, see the nvcc documentation in references.

Short Option Long Option Description
-h --help Print help information on this tool.
-v --verbose List the compilation commands generated by this compiler driver, but do not suppress their execution.
-arch gpuarch --gpu-architecture gpuarch Specify the name of the NVIDIA GPU to compile for. This can either be a real GPU, or a virtual PTX architecture. PTX code represents an intermediate format that can still be further compiled and optimized for, depending on the ptx version, a specific class of actual GPUs .

The architecture specified by this option is the architecture that is assumed by the compilation chain up to the PTX stage, while the architecture(s) specified with the -code option are assumed by the last, potentially runtime, compilation stage.

Currently supported compilation architectures are: virtual architectures compute_10, compute_11, compute_12, compute_13, compute_20, compute_30, compute_35; and GPU architectures sm_10, sm_11, sm_12, sm_13, sm_20, sm_21, sm_30, sm_35.

Note:
1. Eos GPU nodes support virtual architectures up to compute_20 and GPU architectures up to sm_20.
2. The default virtual architecture is compute_10 and the default GPU architecture is sm_10. Double precision is supported by architecture from *_13 and above.
-code gpuarch --gpu-code gpuarch Specify the name of the NVIDIA GPU to generate code for.

nvcc embeds a compiled code image in the executable for each specified code architecture, which is a true binary load image for each real architecture, and PTX code for each virtual architecture.

During runtime, such embedded PTX code will be dynamically compiled by the CUDA runtime system if no binary load image is found for the current GPU.

Architectures specified for options -arch and -code may be virtual as well as real, but the code architectures must be compatible with the arch architecture. When the code option is used, the value for the -arch option must be a virtual PTX architecture.

For instance, arch=compute_13 is not compatible with code=sm_10, because the earlier compilation stages will assume the availability of compute_13 features that are not present on sm_10.

This option defaults to the value of option -arch. Currently supported GPU architectures: sm_10, sm_11, sm_12, sm_13, sm_20, sm_21, sm_30, and sm_35.

Note:
1. Eos GPU nodes support GPU architectures up to sm_20.
2. The default GPU architecture is sm_10. Double precision is supported by architecture from sm_13 and above.
-Xcompiler options,... --compiler-options options,... Specify options directly to the compiler/preprocessor.

Note:
Compiler optimization for device code is automatically turned on. Optimization level for host code is determined by the host code compiler. Since the default optimization level of gcc is -O0 (do not optimize), in order to optimize the host code, we can use '-Xcompiler "-Ox" ' to turn on the Ox optimization.
-Xptxas -v Print out a summary on the amount of used registers and the amount of memory needed per compiled device function.

Examples

    nvcc -o daxpy.exe daxpy.cu 
    nvcc -arch=compute_20 -code=sm_20 -o daxpy.exe daxpy.cu 
    nvcc -arch=sm_20 -o daxpy.exe daxpy.cu 
    nvcc -arch=sm_20 -Xptxas -v -o daxpy.exe daxpy.cu 
    nvcc -arch=sm_20 -o daxpy.exe -Xcompiler "-O3" daxpy.cu