Instructor: Jeremy Johnson

Due date: (Friday Oct. 27 @ 11:59pm)

- Determine the architectural features of your computing platform (clock speed, machine type, amount of memory, cache parameters, number of functional units, pipeline information). You should use various microbenchmark programs to gather this information (e.g. those that come with PAPI such as mem_info, MOB, Calibrator, lmbench, X-Ray. You should confirm as much of the data as possible using CPUID or an architecture manual.
- Carefully time and plot the running time for the Numeric recipes matrix multiplication routine discussed in class. Present the data in terms of Mflops. Make sure to provide relevent compiler information (type, version, flags). Compare optimized and unoptimized versions. Also measure instructions and L1 data cache misses, and other interesting metrics available through PAPI). Plot the normalized data (i.e. divided by the number operations as in MFLOPS) and compare to Mflops.
- Experiment with the different loop orders (i.e. ijk, jik, ikj, jik, kij, kji) and the blocked version. You should measure and plot time, instructions, and other interesting metrics available through PAPI, comparing them to the data in part 2.
- Install ATLAS on your computing platform and perform the same timings/measurements as you did for numeric recipes and compare.
- In the last part of the assignment, you will implement the mini MMM from the Yotov, et al. paper.
- (By definition) Implement by definition (triple loop) using the ijk loop order. This is your code0. You may either use the numeric recipes code or your own.
- (Blocking) Block into micro MMMs with MU = NU = 2, KU=1. The inner triple loop has the kij order. Unroll (by hand) the innermost i- and j-loop such that you have alternately adds and mults and do scalar replacement. This is your code1.
- (Unrolling) Unroll the innermost k-loop by a factor of 2 and 4 (KU=2 and 4, doubles and quadruples the loop body) again doing scalar replacement. Assume that 4 divides NB. This gives you code2 and code3.
- (Performance plot, search for best block size NB) Determine the L1 cache size C1 (in doubles, i.e., 8B units) of your computer. Measure the performance (in Mflops) of your four codes for all NB with 16 &le NB &le min(80, sqrt(C1)) with 4 divides NB. Create a plot: x-axis shows NB, y-axis shows performance (so there will be 4 lines in it). Discuss the plot including: Which NB and which code yields the maximum performance? What is the percentage of peak performance in this case?
- Does it improve if in the best code so far you switch the outermost loop order from ijk to jik?
- Implement an MMM for multiplying two square n x n matrices assuming NB divides n, blocked into NB x NB blocks using your best mini-MMM code from the previous part. Create a performance plot comparing this code and code0 (by definition) above for an interesting range of sizes n (up to sizes where the matrices do not fit into the L2 cache). x-axis shows n; y-axis performance in Mflops. Discuss the plot.