Lecture 8: Shared Memory Parallel Programming
- Material from lectures 5, , and 7
- Timing and performance analysis of sequential programs
- Multi-core processors
- Introduction to shared memory parallel programming with Pthreads
Introduction parallel program design paradigms
- race conditions
- performance issues - synchronization overhead, contention and
granularity, load balance, cache coherency and false sharing.
Introduction to shared memory parallel programming with OpenMP
- Data parallelism (static scheduling)
- Task parallelism with workers
- Divide and conquer parallelism (fork/join)
- Parallel loops
- Parallel regions
- ipdps.ppt - slides on parallel WHT (from IPDPS 02 paper)
- Example programs with UNIX processes
- Pthreads programs
- thr.c - sample program with fork/join
- badcount.c - shared memory program with race condition.
- goodcount.c - shared memory program with mutex.
- bettercount.c - improved shared memory program with mutex.
- timefj.c - program to time fork/join programs to time fork/wait
vs. procedure call.
( wtime.c is used for timing)
- OpenMP programs
- hellob.c - hello with barrier
- Parallel programming examples
- icount.c -
iterative program to add the elements of an array.
- rcount.c -
recursive program to add the elements of an array.
- pcount.c -
Parallel divide and conquer program to add the elements of an array.
- dcount.c -
Parallel dynamic worker program to add the elements of an array.
- scount.c -
Parallel static task program to add the elements of an array.
- cfft.c -
divide and conquer radix two FFT. Relies on
complex.c (don't forget to
link with the math library -lm).
- testcfft.c -
test program for cfft. Input, n = log of size (N = 2^n) and index
0 &le i < N. The output should be the ith column of the N-point DFT
- timecfft.c -
timing program for cfft using
- pcfft.c -
parallel divide and conquer FFT
- pcffts.c -
parallel divide and conquer FFT version 2 (with stride parameter)
- timepcfft.c -
timing program for pcfft using
- timepcffts.c -
timing program for pcffts using
- Implement and time a Parallel radix 2 divide and conquer WHT using
Pthreads. Compute speedup compared to your sequential code. What was
the smallest input size for which you obtained speedup?
- Implement and time parallel multiple WHTs, i.e. I_M tensor W_N
- Implement and time a Parallel WHT,
i.e. W_MN = (W_M tensor I_N)(I_M tensor W_N), using OpenMP.
Both the divide and conquer parts should be parallelized. To
improve performance (remove false sharing) loop interleaving should
be used for (W_M tensor I_N).
Related Links and Info
Created: Nov. 18, 2008 by jjohnson AT cs DOT drexel DOT edu
- Multi-core page from wikipedia
- Multi-core processors
- xeon 5000 series - Intel quad
core (kodiak.cs.drexel.edu) information.
- Intel Nehalem microarchitecture
- Hyper-threading page from wikipedia
- Pthreads page from wikipedia
- Pthreads tutorial.
- openMP specifications
- OpenMP tutorial.