# pbdR

The "Programming with Big Data in R" (pbdR) project enables high level distributed data parallelism in R. pbdR can utilize thousands of cores to speedup R programs and/or handle datasets that are too large for a single node. pbdR contains a number of packages, for eos users, the following are of most use : Some of the other packages installed that will not be discussed here are:## pbdMPI

the pdbMPI package provides MPI functionality to R programs, The subroutine/function calls in MPI are very similar to the ones in pbdMPI. Experience in programming with MPI is not required to use pbdMPI but will be a huge advantage.Below is the canonical "hello world" program using pbdMPI.

# load the pbdMPI library library (pbdMPI , quiet = TRUE ) # setup correct env on all processess init() comm.print("Hello world", all.rank=TRUE, quiet=TRUE ) # clean up the env finalize()This will print out the string "Hello world"

*N*times (where

*N*is the number of MPI tasks).

** NOTE**: the operations *init()* and *finalize()* are required by pbdMPI

To execute the program use the following syntax (assuming 2 mpi tasks ):

mpirun -np 2 Rscript helloworld.r

**NOTE**: the Rscript command is required to run pbdR scripts

### Common pbdMPI operations

Below is a list of the most common operations. These are used to communicate data among tasks and to enforce synchronization.-
**Gather**: create new object on some processor containing a copy of a variable from all processes -
*gather(x)*--> gather to one -
*allgather(x)*--> gather to all **Broadcast**: One process has a variable with a certain value that every other processor needs*bcast(x)***Barrier**: Process will wait at barrier until all other processes have reached tis barrier*barrier()*-
**Reduction**: Apply a reduction operator (e.g. sum) to copy of certain variable on overy process. -
*reduce(x,op='sum')*--> reduce to one -
*allreduce(x,op='sum')*--> reduce to all

#### Example 1: Computing sum of variable in every process

In this example every process will create a random number. The reduce operator will be used to compute the sum of these numbers.# load the pbdMPI library library(pbdMPI, quiet = TRUE) # setup correct env on all processess init() # function to make sure all processes use different random seed comm.set.seed(diff=TRUE) # create random number, assign to n n <- sample(1:10, size=1) # call the reduction operator sm <- allreduce(n, op='sum') # every process prints out the sum comm.print(sm, all.rank=T) # clean up the env finalize()

#### Example 2: broadcast matrix from one process to all other processess

In this example process 0 will create a matrix, then broadcast it to all other processes. Process 1 will print the broadcasted matrix.# load the pbdMPI library library(pbdMPI, quiet = TRUE) # setup correct env on all processess init() # process 0 creates a matrix if (comm.rank()==0) { x <- matrix(1:4,row=2) } else { x <- NULL } # process 0 broadcasts the matrix to the other processes y <- bcast(x,rank.source=0) #print out the matrix comm.print(y,rank=1) # clean up the env finalize()

#### Additional information

This is a very basic intro to pbdMPI. For more detailed information please check out the following resources## pbdMAT

the pbdMAT package provides functionality to create and operate on distributed matrices. Matrices that are too large to store on a single node can be distributed among a (large) number of nodes. pbdMAT has over 100 methods with identical systax to R operators, so for many programs the only change needed might to create a distributed matrix instead of a regular matrix. The pbdMAT package is actually built upon the pbdMPI package sescribed in the previous chapter#### Example 1: Create a distributed matrix and get the diagonal

This example creates a distributed matrix and calls the diag cuntion on the distributed matrix.# load the pbdMAT library library(pbdDMAT, quiet=TRUE) # initialize the grid (i.e. setup env and distribution of data init.grid() # create DISTRIBUTED matrix of 100x100, assign to zero.dmat zero.dmat <- ddmatrix(0, nrow=100, ncol=100) # get the diagonal of the distributed matrix id.dmat <- diag(1, nrow=100, ncol=100, type="ddmatrix") # clean up the env finalize()

** NOTE**: the operations *init.grid()* and *finalize()* are required by pbdMAT

#### Example 2: Create a distributed matrix,extract columns

This example creates a distributed matrix. Next it will extract a submatrix (which is also distributed) and then a non-distributed copy will be created.# load the pbdMAT library library(pbdDMAT, quiet=TRUE) # initialize the grid (i.e. setup env and distribution of data init.grid() # create a distributed matrix x.dmat <- ddmatrix(1:30, nrow=10) # perform extract operation on distributed matrix y.dmat <- x.dmat[c(1, 3, 5, 7, 9), -3] # create a non-distributed version of matrix y <- as.matrix(y.dmat) # print results comm.print(y) # clean up the env finalize()