ECS 289 Project A
Clark Crawford
October 2003

The experiments were conducted on a Pentium III 730 MHz desktop system with 256 KB of cache. The comments in matmul.c suggested anomalous results around the powers of two, and this was immediately evident when I compared the basic (unblocked) multiply on my workstation to the sample timing data that was provided in the assignment tarball:

(Timings for this experiment.) I did not spend time optimizing the basic code since the blocked code was the focus of this assignment. First, I tried a basic comparison of the blocked code to the unblocked code on my machine, to find the optimum block size:

Timings made for this experiment (not all shown above):

Since the best overall timings were seen with a block size of 64, and larger block sizes did not bring about a significant improvement over that, 64 is used as the initial block size for the next phase of the investigation.

Block Caching Optimization

The matrices are stored in column-major format; elements of a column are physically contiguous in memory. This means that multiple elements of a column can be accessed from a single cache line, while each element of a row must (generally) be loaded onto separate cache lines. If we traverse each block in column order, cache misses are minimized. It is a bit tricky to rearrange basic_dgemm so that each block is traversed in column order. For each column of the C-block, the corresponding column of the B-block must be loaded, and every column of the A-block must be loaded. The following algorithm sketches the idea more explicitly:

Inputs: row and col indices of the upper-left corner of the current C-block.

for j = col to (col + blockWidth)
    cptr = C + j * lda + row
    bptr = B + j * lda + row
    for k = col to (col + blockWidth)
        aptr = A + k * lda + row
        for i = 0 to (blockHeight)
            cptr[i] += aptr[i] * bptr[k];

Here, dereferencing of the ptr variables causes cache lines corresponding to the columns to be loaded from main memory. It might be necessary to adjust the blocking factor again once this optimization is made. The following figure shows the results:

Varying the block size did not improve this result:

The main effect of this optimization seems to flatten the curve, smoothing the performance loss at powers of two while reducing overall performance slightly. The marked drop at the very right of the graph is probably the result of the limited cache size.

Register Optimization

The next optimization I tried was to use a floating point register instead of dereferencing the B matrix directly. It was a very simple optimization, but I was surprised by the results:

Loop Unrolling

The next optimization I tried was to replace integer multiplication with addition. I did not expect much from this, either, and it had no noticeable effect. The next thing was to try unrolling the loop and use constant offsets rather than pointer updates. The result was encouraging:

At last, a result faster than the basic algorithm! One advantage of this approach is the relatively small size of the unrolled loop. This small size allows it to benefit many "fringe" block sizes. Encouraged by this result, I implemented the aggressive version of this unrolling strategy:

The strategy is to create a bank of if statements, beginning with one unrolled to 64 operations. That is followed by a block with 32 operations, then one with 16 operations, then one with 8 operations, then 4, then 2, then finally a for loop to handle odd block sizes, and also ensure correctness in case someone goes and changes the block size to a number greater than 127. With this strategy, the "normal" case with a block factor of 64 is executed with no branches and only constant offsets to operands. That is, no pointer arithmetic is done between floating-point computations, so this strategy is essentially a combination of two of the PHiPAC optimizations, but it made more sense to code it this way from the beginning than to go back and recode hundreds of lines. The "fringe" blocks are nearly optimal as well, since they use a logarithmic number of branches and pointer computations. Alternative block sizes did not significantly improve the performance of this optimization strategy.

Interleaving

Finally, I tried interleaving the unrolled loops, to increase the likelihood of multiple cache misses occurring in parallel, and hopefully decreasing overall latency. This was not in the PHiPAC paper, and it seemed to have only a marginal effect:

While interleaving had little impact in this case, the marginal improvement shown here demonstrates that, in theory, it can be an effective method, or at least an optimization worth trying.

Comparison

The comparison system was a Pentium III 863 MHz desktop system with 256 KB of cache. The difference seen was more or less in correspondence to the change in clock speed: about a 20% increase: