Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

pbdR

The "Programming with Big Data in R" (pbdR) project enables high level distributed data parallelism in R. pbdR can utilize thousands of cores to speedup R programs and/or handle datasets that are too large for a single node. pbdR contains a number of packages, for eos users, the following are of most use : Some of the other packages installed that will not be discussed here are:

pbdMPI

the pdbMPI package provides MPI functionality to R programs, The subroutine/function calls in MPI are very similar to the ones in pbdMPI. Experience in programming with MPI is not required to use pbdMPI but will be a huge advantage.

Below is the canonical "hello world" program using pbdMPI.

   # load the pbdMPI library
   library (pbdMPI , quiet = TRUE )

   # setup correct env on all processess
   init()

   comm.print("Hello world", all.rank=TRUE, quiet=TRUE )

   # clean up the env
   finalize()
This will print out the string "Hello world" N times (where N is the number of MPI tasks).

NOTE: the operations init() and finalize() are required by pbdMPI

To execute the program use the following syntax (assuming 2 mpi tasks ):

   mpirun -np 2 Rscript helloworld.r
NOTE: the Rscript command is required to run pbdR scripts

Common pbdMPI operations

Below is a list of the most common operations. These are used to communicate data among tasks and to enforce synchronization.

Example 1: Computing sum of variable in every process

In this example every process will create a random number. The reduce operator will be used to compute the sum of these numbers.
   # load the pbdMPI library
   library(pbdMPI, quiet = TRUE)

   # setup correct env on all processess
   init()

   # function to make sure all processes use different random seed
   comm.set.seed(diff=TRUE)

   # create random number, assign to n
   n <- sample(1:10, size=1)

   # call the reduction operator
   sm <- allreduce(n, op='sum')

   # every process prints out the sum
   comm.print(sm, all.rank=T)

   # clean up the env
   finalize()

Example 2: broadcast matrix from one process to all other processess

In this example process 0 will create a matrix, then broadcast it to all other processes. Process 1 will print the broadcasted matrix.
   # load the pbdMPI library
   library(pbdMPI, quiet = TRUE)

   # setup correct env on all processess
   init()

   # process 0 creates a matrix
   if (comm.rank()==0) {
     x <- matrix(1:4,row=2)
   } else {
     x <- NULL
   }

   # process 0 broadcasts the matrix to the other processes
   y <- bcast(x,rank.source=0)

   #print out the matrix
  comm.print(y,rank=1)

  # clean up the env
  finalize()

Additional information

This is a very basic intro to pbdMPI. For more detailed information please check out the following resources

pbdMAT

the pbdMAT package provides functionality to create and operate on distributed matrices. Matrices that are too large to store on a single node can be distributed among a (large) number of nodes. pbdMAT has over 100 methods with identical systax to R operators, so for many programs the only change needed might to create a distributed matrix instead of a regular matrix. The pbdMAT package is actually built upon the pbdMPI package sescribed in the previous chapter

Example 1: Create a distributed matrix and get the diagonal

This example creates a distributed matrix and calls the diag cuntion on the distributed matrix.
   # load the pbdMAT library
   library(pbdDMAT, quiet=TRUE)

   # initialize the grid (i.e. setup env and distribution of data
   init.grid()

   # create DISTRIBUTED matrix of 100x100, assign to zero.dmat
   zero.dmat <- ddmatrix(0, nrow=100, ncol=100)

   # get the diagonal of the distributed matrix
   id.dmat <- diag(1, nrow=100, ncol=100, type="ddmatrix")

   # clean up the env
   finalize()

NOTE: the operations init.grid() and finalize() are required by pbdMAT

Example 2: Create a distributed matrix,extract columns

This example creates a distributed matrix. Next it will extract a submatrix (which is also distributed) and then a non-distributed copy will be created.
   # load the pbdMAT library
   library(pbdDMAT, quiet=TRUE)

   # initialize the grid (i.e. setup env and distribution of data
   init.grid()

   # create a distributed matrix
   x.dmat <- ddmatrix(1:30, nrow=10)

   # perform extract operation on distributed matrix
   y.dmat <- x.dmat[c(1, 3, 5, 7, 9), -3]

   # create a non-distributed version of matrix
   y <- as.matrix(y.dmat)

   # print results
   comm.print(y)

   # clean up the env
   finalize()

Additional information

This is a very basic intro to pbdMAT. For more detailed information please check out the following resources