- (By definition) Implement by definition (triple loop) using the ijk loop order. This is your code0.
You may either use the numeric recipes code
(from assignmen 1)or your own.
- (Blocking) Block into micro MMMs with MU = NU = 2, KU=1. The inner triple loop has the kij order.
Unroll (by hand) the innermost i- and j-loop such that you have alternately adds and mults and do
scalar replacement. This is your code1.
- (Unrolling) Unroll the innermost k-loop by a factor of 2 and 4 (KU=2 and 4, doubles and quadruples
the loop body) again doing scalar replacement. Assume that 4 divides NB. This gives you code2 and
- (Performance plot, search for best block size NB) Determine the L1 cache size C1 (in doubles, i.e.,
8B units) of your computer. Measure the performance (in Mflops) of your four codes for all NB with
16 ≤ NB ≤ min(80, sqrt(C1)) with 4 divides NB. Create a plot: x-axis shows NB, y-axis shows
performance (so there will be 4 lines in it). Discuss the plot including: Which NB and which code
yields the maximum performance? What is the percentage of peak performance in this case?
- Does it improve if in the best code so far you switch the outermost loop order from ijk to jik?
- Implement an MMM for multiplying two square n x n matrices assuming NB divides n, blocked into
NB x NB blocks using your best mini-MMM code from the previous part. Create a performance plot
comparing this code and code0 (by definition) above for an interesting range of sizes n (up to sizes
where the matrices do not fit into the L2 cache). x-axis shows n; y-axis performance in Mflops.
Discuss the plot.
You should prepare a summary report of your experiments and performance data. Make
sure all relevent (e.g. compiler version and flags, machine type (memory, clock speed,
cache info) information for each set of data is provided. Make sure all plots are
labeled and easy to read. Raw data and all source files (along with compilation
instructions - i.e. a makefile) should be provided. The report should be submitted as a
pdf file. Make sure that any code that is timed is correct.
Students should submit their solution electronically using BbVista.
Submit a gzipped tar file, called A2.tar.gz (the tar file should contain a directory called A1 which
contains the files). The tar file should contain source code, instructions how to run your programs,
sample input and output files, and a README file. The README file should describe all files that are
included, contain instructions how to build and use the code, and outline how the code works.