The experiments were conducted on a Pentium III 730 MHz desktop system with 256
KB of cache. The comments in matmul.c
suggested anomalous results
around the powers of two, and this was immediately evident when I compared the
basic (unblocked) multiply on my workstation to the sample timing data that was
provided in the assignment tarball:
(Timings for this experiment.) I did not spend time optimizing the basic code since the blocked code was the focus of this assignment. First, I tried a basic comparison of the blocked code to the unblocked code on my machine, to find the optimum block size:
Timings made for this experiment (not all shown above):
Since the best overall timings were seen with a block size of 64, and larger block sizes did not bring about a significant improvement over that, 64 is used as the initial block size for the next phase of the investigation.
The matrices are stored in column-major format; elements of a column are
physically contiguous in memory. This means that multiple elements of a column
can be accessed from a single cache line, while each element of a row must
(generally) be loaded onto separate cache lines. If we traverse each block in
column order, cache misses are minimized. It is a bit tricky to rearrange
basic_dgemm
so that each block is traversed in column order. For
each column of the C-block, the corresponding column of the B-block must be
loaded, and every column of the A-block must be loaded. The following algorithm
sketches the idea more explicitly:
Inputs: row
and col
indices of the upper-left
corner of the current C-block.
for j = col to (col + blockWidth) cptr = C + j * lda + row bptr = B + j * lda + row for k = col to (col + blockWidth) aptr = A + k * lda + row for i = 0 to (blockHeight) cptr[i] += aptr[i] * bptr[k];
Here, dereferencing of the ptr
variables causes cache lines
corresponding to the columns to be loaded from main memory. It might be
necessary to adjust the blocking factor again once this optimization is made.
The following figure shows the results:
Varying the block size did not improve this result:
The main effect of this optimization seems to flatten the curve, smoothing the performance loss at powers of two while reducing overall performance slightly. The marked drop at the very right of the graph is probably the result of the limited cache size.
The next optimization I tried was to use a floating point register instead of dereferencing the B matrix directly. It was a very simple optimization, but I was surprised by the results:
The next optimization I tried was to replace integer multiplication with addition. I did not expect much from this, either, and it had no noticeable effect. The next thing was to try unrolling the loop and use constant offsets rather than pointer updates. The result was encouraging:
At last, a result faster than the basic algorithm! One advantage of this approach is the relatively small size of the unrolled loop. This small size allows it to benefit many "fringe" block sizes. Encouraged by this result, I implemented the aggressive version of this unrolling strategy:
The strategy is to create a bank of if
statements, beginning with
one unrolled to 64 operations. That is followed by a block with 32 operations,
then one with 16 operations, then one with 8 operations, then 4, then 2, then
finally a for
loop to handle odd block sizes, and also ensure
correctness in case someone goes and changes the block size to a number greater
than 127. With this strategy, the "normal" case with a block factor of 64 is
executed with no branches and only constant offsets to operands. That is, no
pointer arithmetic is done between floating-point computations, so this strategy
is essentially a combination of two of the PHiPAC optimizations, but it made
more sense to code it this way from the beginning than to go back and recode
hundreds of lines. The "fringe" blocks are nearly optimal as well, since they
use a logarithmic number of branches and pointer computations. Alternative
block sizes did not significantly improve the performance of this optimization
strategy.
Finally, I tried interleaving the unrolled loops, to increase the likelihood of multiple cache misses occurring in parallel, and hopefully decreasing overall latency. This was not in the PHiPAC paper, and it seemed to have only a marginal effect:
While interleaving had little impact in this case, the marginal improvement shown here demonstrates that, in theory, it can be an effective method, or at least an optimization worth trying.
The comparison system was a Pentium III 863 MHz desktop system with 256 KB of cache. The difference seen was more or less in correspondence to the change in clock speed: about a 20% increase: