CS 540 High Performance Computing
Instructor: Jeremy Johnson
Due date: (Sat. Oct. 24 by 11:59pm)
In this assignment you will empirically explore the performance of matrix multiplication.
You will start with the code from Numeric Recipes
(included here), explore several variants,
and finally compare against high quality code from ATLAS or
Intel's MKL library .
Here is the code from Numeric Recipes (along with a modified version which uses block matrix multiplication.
- Determine the architectural features of your computing platform (clock
speed, machine type, amount of memory, cache parameters, number of
functional units, pipeline information). You can gather this information
using programs that use the CPUID instruction such as PAPI_mem_info,
microbenchmark programs such as the cache program discussed in
lecture 4 (in this regard also see
You should confirm as much of the data as possible using an architecture
- Carefully time and plot the running time for the Numeric recipes matrix multiplication routine
discussed in class. Present the data in terms of MFLOPS. Make sure to provide relevent compiler
information (type, version, flags). Compare optimized and unoptimized versions. Also measure
instructions and L1 data cache misses, and other interesting metrics available through PAPI).
Plot the normalized data (i.e. divided by the number operations as in MFLOPS) and compare to peak MFLOPS
obtainable on your machine.
- Experiment with the different loop orders (i.e. ijk, jik, ikj, jik, kij, kji) and the blocked version.
You should measure and plot time, instructions, and other interesting metrics available through PAPI,
comparing them to the data in part 2.
- Install ATLAS or MKL on your computing platform and perform the same timings/measurements as you did for
numeric recipes and compare. [For extra credit install both and compare]
You should prepare a summary report of your experiments and performance data.
Make sure all relevent (e.g. compiler version and flags, machine type
(memory, clock speed, cache info) information for each set of data is
provided. Make sure all plots are labeled and easy to read. Raw data and
all source files (along with compilation instructions - i.e. a makefile) should be provided. Make sure that any code that is timed is correct.
The report should be concise and easy to read. It should be organized
as follows: 1) Introduction to what was done, 2) summary of the main
results you learned, 3) Description of computing platform, 4) Summary
of experiments performed, tests for correctness, and data collected including
plots. Include key observations for the plots summarize what is shown.
Indicate source files for code timed along with timing programs, how to
run the experiments and files where the data is available.
5) Conclusion. Submit the report as a pdf file.
Students should submit their solution electronically using BbVista.
Submit a gzipped tar file, called A1.tar.gz (the tar file should contain a directory called A1 which
contains the files). The tar file should contain source code, instructions how to run your programs,
sample input and output files, and a README file. The README file should describe all files that are
included, contain instructions how to build and use the code, and outline how the code works.