# Lecture 7: Short Vector Instructions (SIMD Computation)

### Background Material

• Jeremy Johnson and Markus Püschel (2000), "In Search of the Optimal Walsh-Hadamard Transform," Proc. ICASSP 2000, pp. 3347-3350. (wht.ps,wht.pdf)
• Lecture 6 on the WHT.

• J. Johnson, R.W. Johnson, D. Rodriguez, and R. Tolimieri (1990), "A Methodology for Designing, Modifying, and Implementing Fourier Transform Algorithms on Various Architectures," Circuits, Systems, and Signal Processing 9, 449-500. (fft.ps,fft.pdf)

### Topics

• Overview of SSE
• Example programs (using gcc intrinsics)
1. uloop.c - unrolled version of loop.c
2. vloop.c - vectorized version of loop.c
3. vloop.s - assembly code for vloop.c
4. testloops.c - test program for vloop.c
5. vinner.c - vectorizedinner product.
6. testinners.c - test program for inner product.
7. tshuffle.c - test program to illustrate shuffle instruction.
8. vinner2.c - alternative vector version of inner product.
• Implementing the WHT using vector intstructions
• Review of vector instructions (short vector (SSE) and long vector [old vector super computers such as Cray]
• Implementing wht WHT using vector instructions [e.g. WHT_4]
• Vectorized WHT factorization.
1. WHT_8 = (WHT_2 tensor I_4)(I_2 tensor WHT_2 tensor I_2)(I_4 tensor WHT_2)
2. WHT_8 = (WHT_2 tensor I_4)L^8_2(WHT_2 tensor I_4)L^8_4 L^8_4 (WHT_2 tensor I_4)L^8_2
3. WHT_8 = (WHT_2 tensor I_4)L^8_2 (WHT_2 tensor I_4)L^8_2 (WHT_2 tensor I_4)L^8_2
• General vector formula using vector of maximal length
• Formula with given vector length
• Loop interleaving and vectorizing (I_m tensor WHT_n tensor I_pv).
1. (I_m tensor WHT_N tensor I_pv) = (I_m tensor WHT_N tensor I_p tensor I_v)
2. = (I_m tensor L^(Np)_N (I_p tensor WHT_N)L^(Np)_p tensor I_v)
3. = (I_m tensor ((L^(Np)_N tensor I_v) (I_p tensor WHT_N tensor I_v) (L^(Np)_p tensor I_v)) [ this is a doubly nested loop of calls to WHT_N tensor I_v. The stride permutations operate on vectors of size v and give the addressing of the vectors used in WHT_N tensor I_v]