Project 1: Memory Hierarchies and Performance of Serial Applications

Due: EOD, 23 January

Learning goals

In this project you will explore how memory hierarchies affect performance for serial applications.

arithmetic intensity: how to compute; divides; memory access
at least 2 different system architectures. Compare and contrast.

All of these aspects of serial applications are transferable to parallel applications, as we shall see later in the course.

Warm-up

Review the section in HPSC on computing arithmetic intensity for given compute kernels. Then, as a group, compute the arithmetic intensities of the following kernels in units of FLOPs/byte, assuming 8 bytes per float.

  Y[j] += Y[j] + A[j][i] * B[i]

  s += A[i] * A[i]

  s += A[i] * B[i]

  Y[i] = A[i] + C*B[i]

Included a table in your project report listing the arithmetic intensities for these kernels.

Part 1: Matrix-matrix Multiplication

In this first part of the project, you will test the performance of the basic matrix-matrix multiplication operation as a function of matrix size on multiple compute platforms. Complete the following using at least two different compute architectures (e.g., your laptop and HPCC, or two different node architectures on HPCC).

With your group, write a program that multiplies two matrices together (see: http://mathworld.wolfram.com/MatrixMultiplication.html). Use GitHub to manage the code development and version history.
For a given matrix size N, what is the total number of floating point operations performed by this operator?
Using the supplied C routine get_walltime.c, or any other accurate means of measuring time, compute the performance in Mflop/s of the matrix-matrix multiply for N=100. Be sure to perform enough repeat calculations of the timing to overcome any statistical noise in the measurement.
For the system you are running on, determine the clock speed of the processor and the cache size/layout. Use this information to estimate the theoretical peak performance of the system, assuming that the processor is capable of one flop per clock cycle (generally NOT true on modern architectures). How does the performance you measured in (3) compare to the theoretical peak performance of your system?
Now repeat the performance measurement for a range of matrix size N from 1 to 10,000,000. Make a plot of the resulting measured Gflop/s vs. N. On this plot place a horizontal line representing the theoretical peak performance based upon your system’s clock speed.
How does the measured performance for multiple N’s compare to peak? Are there any “features” in your plot? Explain them in the context of the hardware architecture of your system. Include in your write-up a description of your system’s architecture (processor, cache, etc.).

To your project git repo, commit your code for performing the matrix-matrix multiply performance measurements, the plots of your results, and a brief write-up (in plain text or markdown) addressing the above questions and discussing your results. Pay particular attention to the comparison between different architectures and give explanations for them.

Part 2: The Roofline Model

In this part, you will explore the roofline model for analyzing the interplay between arithmetic intensity and memory bandwidth for architectures with complex memory hierarchies. Complete the following exercises on the SAME compute architectures that you used in Part 1 above.

Reference the materials on the Roofline Performance model at https://crd.lbl.gov/divisions/amcr/computer-science-amcr/par/research/roofline/. In particular, look through “Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures” and the slides at https://crd.lbl.gov/assets/pubs_presos/parlab08-roofline-talk.pdf.
Clone the CS Roofline Toolkit, git clone https://bitbucket.org/berkeleylab/cs-roofline-toolkit.git. Modify one of the config files in Empirical_Roofline_Tool-1.1.0/Config as necessary for the machine you are using.
Run the ERT in serial mode on your local machine. Report the peak performances and bandwidths (for all caches levels as well as DRAM). Where is the “ridge point” of the roofline for the various cases?
Consider the four FP kernels in “Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures” (see their Table 2). Assuming the high end of operational (i.e., “arithmetic”) intensity, how would these kernels perform on the platforms you are testing? What optimization strategy would you recommend to increase performance of these kernels?
Address the same questions in (4) for the four kernels given in the Warm-up above.
Compare your results for the roofline model to what you obtained for the matrix-matrix multiplication operation from Part 1. How are the rooflines of memory bandwidth related to the features in the algorithmic performance as a function of matrix size?

To your project write-up, add your plots of the roofline model for the systems you tested, and responses addressing the above questions.

CMSE 822: Parallel Computing

Michigan State University - Computational Mathematics, Science, and Engineering 822: Parallel Computing