Vector Update: 2D Form
With DAXPY operation in the inner loop, we should consider further optimization with outer loop unrolling and blocking.
- hand tuning was necessary
- compiler would not implement loop interchange because in the original code the loops are not properly nested
- With the DAXPY formulation, we can consider 2-dimensional implementation of that code:
real*8 dp(ni,nj), p(ni,nib)
id1 =nma1+(i-1)*nru+(l-1)*nra*nrub
dp(ii,jj)=dp(ii,jj)+p(ii,jj)*dp_temp