Project 2: Getting Started with MPI

Due: EOD, 9 February

Learning goals

In this project you will explore using basic MPI collectives on HPCC. After finishing this project, you should

understand how to parallelize a simple serial code with MPI,
understand how to use basic MPI collectives,
be able to run MPI-parallel applications on HPCC,
develop your theoretical understanding of key parallel computing concepts such as:
- functional parallelism,
- collective communication,
- parallel scaling, and
- parallel efficiency.

Part 1: Warm-up Exercises

As a group, complete the following exercises from HPSC.

Exercise 2.18
Exercise 2.19
Exercise 2.21
Exercise 2.22
Exercise 2.23
Exercise 2.27

Include your responses to these exercises in your project write-up.

Part 2: Setup on HPCC

The following is a very quick tutorial on the basics of using HPCC for this class.

Log in to the HPCC gateway:
```
 ssh <netid>@hpcc.msu.edu
```
Then log in to the AMD-20 cluster from the gateway:
```
 module load powertools
 amd20
```
The default environment and software stack should be sufficient for this exercise, but if you run into issues compiling and running the code try issuing the following commands.
```
 module purge
 module load intel/2021a
```
When using the HPCC for development and exercises in this class please do NOT just use the head node, dev-amd20. We will swamp the node and no one will get anything done. Instead, request an interactive job using the SLURM scheduler. An easy way to do this is to set up an alias command like so:
```
 alias devjob='salloc -n 4 --time 1:30:00'
```
Run devjob, then to request 4 tasks for 90 minutes. This should be sufficient for most of the stuff we do during class, though for your projects you will at times require more resources. The above module and alias commands can be added to your .bashrc so that they are automatically executed when you log in.

Part 3: MPI Basics

Clone your Project 2 repo from GitHub on HPCC.
In the project directory you will find a simple “Hello World!” source file in C++. Compile and run the code. E.g.,
```
 g++ hello.cpp
 ./a.out
```
Now run the executable a.out using the MPI parallel run command and explain the output:
```
 mpiexec -n 4 ./a.out 
```
Add the commands MPI_Init and MPI_Finalize to your code. Put three different print statements in your code: one before the init, one between init and finalize, and one after the finalize. Recompile and run the executable, both in serial and with mpiexec, and explain the output.
Complete Exercises 2.3, 2.4, and 2.5 in the Parallel Programing book.

Part 4: Eat Some Pi

Pi is the ratio of a circle’s circumference to its diameter. As such, the value of pi can be computed as follows. Consider a circle of radius r inscribed in a square of side length r. Randomly generate points within the square. Determine the number of points in the square that are also in the circle. If f=nc/ns is the number of points in the circle divided by the number of points in the square then pi can be approximated as pi ~ 4f. Note that the more points generated, the better the approximation.

Look at the C program ser_pi_calc. Extend this program using collective MPI routines to compute pi in parallel using the method described above. Feel free to use C++, if you prefer, of course.
For the first iteration, perform the same number of “rounds” on each MPI rank. Measure the total runtime using MPI_WTIME(). Vary the number of ranks used from 1 to 4. How does the total runtime change?
Now, divide the number of “rounds” up amongst the number of ranks using the appropriate MPI routines to decide how to distribute the work. Again, run the program on 1 to 4 ranks. How does the runtime vary now?
Now let’s change the number of “darts” and ranks. Use your MPI program to compute pi using total numbers of “darts” of 1E3, 1E6, and 1E9. For each dart count, run your code on HPCC with processor counts of 1, 2, 4, 8, 16, 32, and 64. Keep track of the resulting value of pi and the runtimes. Use non-interactive jobs and modify the submitjob.sb script as necessary.
For each processor count, plot the resulting errors in your computed values of pi compared to the true value as functions of the number of darts used to compute it. Use log-log scaling on this plot. What is the rate of convergence of your algorithm for computing pi? Does this value make sense? Does the error or rate of convergence to the correct answer vary with processor count? Should it?
For each dart count, make a plot of runtime versus processor count. Each line represents a “strong scaling” study for your code. For each dart count, also plot the “ideal” scaling line. Calculate the parallel scaling efficiency of your code for each dart count. Does the parallel performance vary with dart count? Explain your answer.
Going further. Try running your code on different node types on HPCC with varied core counts. In particular, try to ensure that your runs utilize multiple nodes so that the communication network is used. Do you see a change in the communication cost when the job is run on more than one node?

What to turn-in

To your git project repo, commit your final working code for the above exercises and a concise write-up including responses to the warm-up exercises, performance and accuracy data for your calculations of pi, the plots made in Part 4, and detailed responses to the questions posed concerning your results.

CMSE 822: Parallel Computing

Michigan State University - Computational Mathematics, Science, and Engineering 822: Parallel Computing