Outer Loop Unrolling
- the unroll factor should match the cache line size
- mostly 1st level cache optimization
- if the data fits into the 2nd level cache, this is good optimization to use
A(I) is constant for the inner loop J
C(J) is traversed each I iteration
B(I,J) is traversed poorly
will load the complete cache
line of B in to the registers