Assignment 1

CS 540 High Performance Computing
Instructor: Jeremy Johnson
Due date: (Sat. Oct. 24 by 11:59pm)

In this assignment you will empirically explore the performance of matrix multiplication. You will start with the code from Numeric Recipes (included here), explore several variants, and finally compare against high quality code from ATLAS or Intel's MKL library .

Numerical Recipes

Here is the code from Numeric Recipes (along with a modified version which uses block matrix multiplication.

Assignment Tasks

  1. Determine the architectural features of your computing platform (clock speed, machine type, amount of memory, cache parameters, number of functional units, pipeline information). You can gather this information using programs that use the CPUID instruction such as PAPI_mem_info, microbenchmark programs such as the cache program discussed in lecture 4 (in this regard also see MOB, Calibrator, lmbench), You should confirm as much of the data as possible using an architecture manual.
  2. Carefully time and plot the running time for the Numeric recipes matrix multiplication routine discussed in class. Present the data in terms of MFLOPS. Make sure to provide relevent compiler information (type, version, flags). Compare optimized and unoptimized versions. Also measure instructions and L1 data cache misses, and other interesting metrics available through PAPI). Plot the normalized data (i.e. divided by the number operations as in MFLOPS) and compare to peak MFLOPS obtainable on your machine.
  3. Experiment with the different loop orders (i.e. ijk, jik, ikj, jik, kij, kji) and the blocked version. You should measure and plot time, instructions, and other interesting metrics available through PAPI, comparing them to the data in part 2.
  4. Install ATLAS or MKL on your computing platform and perform the same timings/measurements as you did for numeric recipes and compare. [For extra credit install both and compare]


You should prepare a summary report of your experiments and performance data. Make sure all relevent (e.g. compiler version and flags, machine type (memory, clock speed, cache info) information for each set of data is provided. Make sure all plots are labeled and easy to read. Raw data and all source files (along with compilation instructions - i.e. a makefile) should be provided. Make sure that any code that is timed is correct.

The report should be concise and easy to read. It should be organized as follows: 1) Introduction to what was done, 2) summary of the main results you learned, 3) Description of computing platform, 4) Summary of experiments performed, tests for correctness, and data collected including plots. Include key observations for the plots summarize what is shown. Indicate source files for code timed along with timing programs, how to run the experiments and files where the data is available. 5) Conclusion. Submit the report as a pdf file.

Students should submit their solution electronically using BbVista. Submit a gzipped tar file, called A1.tar.gz (the tar file should contain a directory called A1 which contains the files). The tar file should contain source code, instructions how to run your programs, sample input and output files, and a README file. The README file should describe all files that are included, contain instructions how to build and use the code, and outline how the code works.