Assignment 2

CS 540/ECE 621 High Performance Computing
Instructor: Jeremy Johnson 
Due date: (Mon. Nov. 24 in class)

In this assignment, you will implement the mini MMM from the Yotov, et al. paper discussed in .
  1. (By definition) Implement by definition (triple loop) using the ijk loop order. This is your code0. You may either use the numeric recipes code (from assignmen 1)or your own.
  2. (Blocking) Block into micro MMMs with MU = NU = 2, KU=1. The inner triple loop has the kij order. Unroll (by hand) the innermost i- and j-loop such that you have alternately adds and mults and do scalar replacement. This is your code1.
  3. (Unrolling) Unroll the innermost k-loop by a factor of 2 and 4 (KU=2 and 4, doubles and quadruples the loop body) again doing scalar replacement. Assume that 4 divides NB. This gives you code2 and code3.
  4. (Performance plot, search for best block size NB) Determine the L1 cache size C1 (in doubles, i.e., 8B units) of your computer. Measure the performance (in Mflops) of your four codes for all NB with 16 ≤ NB ≤ min(80, sqrt(C1)) with 4 divides NB. Create a plot: x-axis shows NB, y-axis shows performance (so there will be 4 lines in it). Discuss the plot including: Which NB and which code yields the maximum performance? What is the percentage of peak performance in this case?
  5. Does it improve if in the best code so far you switch the outermost loop order from ijk to jik?
  6. Implement an MMM for multiplying two square n x n matrices assuming NB divides n, blocked into NB x NB blocks using your best mini-MMM code from the previous part. Create a performance plot comparing this code and code0 (by definition) above for an interesting range of sizes n (up to sizes where the matrices do not fit into the L2 cache). x-axis shows n; y-axis performance in Mflops. Discuss the plot.


You should prepare a summary report of your experiments and performance data. Make sure all relevent (e.g. compiler version and flags, machine type (memory, clock speed, cache info) information for each set of data is provided. Make sure all plots are labeled and easy to read. Raw data and all source files (along with compilation instructions - i.e. a makefile) should be provided. The report should be submitted as a pdf file. Make sure that any code that is timed is correct.

Students should submit their solution electronically using BbVista. Submit a gzipped tar file, called A2.tar.gz (the tar file should contain a directory called A1 which contains the files). The tar file should contain source code, instructions how to run your programs, sample input and output files, and a README file. The README file should describe all files that are included, contain instructions how to build and use the code, and outline how the code works.