Project 2: Getting Started with MPI
Due: EOD, 9 February
Learning goals
In this project you will explore using basic MPI collectives on HPCC. After finishing this project, you should
- understand how to parallelize a simple serial code with MPI,
- understand how to use basic MPI collectives,
- be able to run MPI-parallel applications on HPCC,
- develop your theoretical understanding of key parallel computing concepts such as:
- functional parallelism,
- collective communication,
- parallel scaling, and
- parallel efficiency.
Part 1: Warm-up Exercises
As a group, complete the following exercises from HPSC.
- Exercise 2.18
- Exercise 2.19
- Exercise 2.21
- Exercise 2.22
- Exercise 2.23
- Exercise 2.27
Include your responses to these exercises in your project write-up.
Part 2: Setup on HPCC
The following is a very quick tutorial on the basics of using HPCC for this class.
-
Log in to the HPCC gateway:
ssh <netid>@hpcc.msu.edu
-
Then log in to the AMD-20 cluster from the gateway:
module load powertools amd20
-
The default environment and software stack should be sufficient for this exercise, but if you run into issues compiling and running the code try issuing the following commands.
module purge module load intel/2021a
-
When using the HPCC for development and exercises in this class please do NOT just use the head node,
dev-amd20
. We will swamp the node and no one will get anything done. Instead, request an interactive job using the SLURM scheduler. An easy way to do this is to set up an alias command like so:alias devjob='salloc -n 4 --time 1:30:00'
-
Run
devjob
, then to request 4 tasks for 90 minutes. This should be sufficient for most of the stuff we do during class, though for your projects you will at times require more resources. The abovemodule
andalias
commands can be added to your.bashrc
so that they are automatically executed when you log in.
Part 3: MPI Basics
-
Clone your Project 2 repo from GitHub on HPCC.
-
In the project directory you will find a simple “Hello World!” source file in C++. Compile and run the code. E.g.,
g++ hello.cpp ./a.out
-
Now run the executable
a.out
using the MPI parallel run command and explain the output:mpiexec -n 4 ./a.out
-
Add the commands
MPI_Init
andMPI_Finalize
to your code. Put three different print statements in your code: one before the init, one between init and finalize, and one after the finalize. Recompile and run the executable, both in serial and withmpiexec
, and explain the output. -
Complete Exercises 2.3, 2.4, and 2.5 in the Parallel Programing book.
Part 4: Eat Some Pi
Pi is the ratio of a circle’s circumference to its diameter. As such, the value of pi can be computed as follows. Consider a circle of radius r
inscribed in a square of side length r
. Randomly generate points within the square. Determine the number of points in the square that are also in the circle. If f=nc/ns
is the number of points in the circle divided by the number of points in the square then pi
can be approximated as pi ~ 4f
. Note that the more points generated, the better the approximation.
-
Look at the C program
ser_pi_calc
. Extend this program using collective MPI routines to computepi
in parallel using the method described above. Feel free to use C++, if you prefer, of course. -
For the first iteration, perform the same number of “rounds” on each MPI rank. Measure the total runtime using
MPI_WTIME()
. Vary the number of ranks used from 1 to 4. How does the total runtime change? -
Now, divide the number of “rounds” up amongst the number of ranks using the appropriate MPI routines to decide how to distribute the work. Again, run the program on 1 to 4 ranks. How does the runtime vary now?
-
Now let’s change the number of “darts” and ranks. Use your MPI program to compute
pi
using total numbers of “darts” of 1E3, 1E6, and 1E9. For each dart count, run your code on HPCC with processor counts of 1, 2, 4, 8, 16, 32, and 64. Keep track of the resulting value ofpi
and the runtimes. Use non-interactive jobs and modify thesubmitjob.sb
script as necessary. -
For each processor count, plot the resulting errors in your computed values of
pi
compared to the true value as functions of the number of darts used to compute it. Use log-log scaling on this plot. What is the rate of convergence of your algorithm for computingpi
? Does this value make sense? Does the error or rate of convergence to the correct answer vary with processor count? Should it? -
For each dart count, make a plot of runtime versus processor count. Each line represents a “strong scaling” study for your code. For each dart count, also plot the “ideal” scaling line. Calculate the parallel scaling efficiency of your code for each dart count. Does the parallel performance vary with dart count? Explain your answer.
-
Going further. Try running your code on different node types on HPCC with varied core counts. In particular, try to ensure that your runs utilize multiple nodes so that the communication network is used. Do you see a change in the communication cost when the job is run on more than one node?
What to turn-in
To your git project repo, commit your final working code for the above exercises and a concise write-up including responses to the warm-up exercises, performance and accuracy data for your calculations of pi
, the plots made in Part 4, and detailed responses to the questions posed concerning your results.