Lecture 3: Short Vector Instructions (SIMD Computation)
Background Material
- Jeremy Johnson and Markus Püschel (2000), "In Search of the Optimal Walsh-Hadamard
Transform," Proc. ICASSP 2000, pp. 3347-3350.
(wht.ps,wht.pdf)
- Lecture 6 on the WHT.
Reading
- J. Johnson, R.W. Johnson, D. Rodriguez, and R. Tolimieri (1990),
"A Methodology for Designing, Modifying, and Implementing Fourier
Transform Algorithms on Various Architectures," Circuits, Systems, and
Signal Processing 9, 449-500.
(fft.ps,fft.pdf)
Topics
- Overview of SSE
- Example programs (using gcc intrinsics)
- uloop.c - unrolled version of loop.c
- vloop.c - vectorized version of loop.c
- vloop.s - assembly code for vloop.c
- testloops.c - test program for vloop.c
- vinner.c - vectorizedinner product.
- testinners.c - test program for inner product.
- tshuffle.c - test program to illustrate shuffle instruction.
- vinner2.c - alternative vector version of inner product.
- Implementing the WHT using vector intstructions
- Review of vector instructions (short vector (SSE) and long vector [old
vector super computers such as Cray]
- Implementing wht WHT using vector instructions [e.g. WHT_4]
- Vectorized WHT factorization.
- WHT_8 = (WHT_2 tensor I_4)(I_2 tensor WHT_2 tensor I_2)(I_4 tensor WHT_2)
- WHT_8 = (WHT_2 tensor I_4)L^8_2(WHT_2 tensor I_4)L^8_4 L^8_4
(WHT_2 tensor I_4)L^8_2
- WHT_8 = (WHT_2 tensor I_4)L^8_2
(WHT_2 tensor I_4)L^8_2
(WHT_2 tensor I_4)L^8_2
- General vector formula using vector of maximal length
- Formula with given vector length
- Loop interleaving and vectorizing (I_m tensor WHT_n tensor I_pv).
- (I_m tensor WHT_N tensor I_pv) =
(I_m tensor WHT_N tensor I_p tensor I_v)
- = (I_m tensor L^(Np)_N (I_p tensor WHT_N)L^(Np)_p tensor I_v)
- = (I_m tensor ((L^(Np)_N tensor I_v) (I_p tensor WHT_N tensor I_v)
(L^(Np)_p tensor I_v)) [ this is a doubly nested loop
of calls to WHT_N tensor I_v. The stride permutations
operate on vectors of size v and give the addressing
of the vectors used in WHT_N tensor I_v]
Lecture notes and slides
Programs and References
Tasks
- Implement and time a vector version of W_4 using sse2.
- Implement and time a vector version of W_8 using sse.
- Implement and time a vector version of the recursive WHT algorithm.
Created: Nov. 5, 2008 by jjohnson AT cs DOT drexel DOT edu