Using MPI

From athena

CAVEAT EMPTOR: There is no "quick" tutorial on MPI jobs. Either you've done this before or you're going to have to dig in with an HPC expert friend of yours. If you don't have a MPI guru friend, feel free to contact Jeff Gardner or the PACS staff and we can direct you to expertise.

1 Learning about Parallel Programming
2 Compiling and Running MPI Programs on Athena
3 Additional InfiniBand Resources

[edit] Learning about Parallel Programming

Learning how to parallelize jobs isn't impossible, however, this cluster demands a certain level of savvy in order to optimize internodal communications. Most likely your code base from your last MPI run should work, but in order to really see your code fly (and use this machine to it's fullest extent) we recommend using a medley of OpenMP (or threading of some sort) and OpenMPI (or mvapich).

Sites with very helpful information:

https://wiki.rocksclusters.org/wiki/index.php/Using_MPI-Selector_with_the_Cisco-OFED_Roll

http://www.openmp.org/blog/

https://computing.llnl.gov/tutorials/openMP/

http://ait.web.psi.ch/services/linux/hpc/hpc_user_cookbook/parallel_computing/openmp/

[edit] Compiling and Running MPI Programs on Athena

The following instructions assume some familiarity with the basics of MPI execution and contain site-specific information.

[edit] Step 1: Selecting your MPI

ClusterCorp's mpi-selector-menu command sets up your mpi environment for you and sets paths correctly. This is fairly well documented on the "Using MPI-Selector" URL above. (You can even speed this process up by simply using mpi-selector.) We recommend using openmpi with the intel compilers, although we have not done extensive tests with mvapich. Also, the PGI compilers are not installed. To select your MPI environment, do as follows:

[richardc@athena0 ~]$ mpi-selector-menu 
Current system default: <none>
Current user default:   <none> 

    "u" and "s" modifiers can be added to numeric and "U"
    commands to specify "user" or "system-wide". 

1. mvapich_gcc-0.9.9
2. mvapich_intel-0.9.9
3. mvapich_pgi-0.9.9
4. openmpi_gcc-1.2.2
5. openmpi_intel-1.2.2
6. openmpi_pgi-1.2.2
U. Unset default
Q. Quit

Selection (1-6[us], U[us], Q): 5
Operator on the per-user or system-wide default (u/s)? u

Now my user default environment is intel compilers with openmpi libraries.

IMPORTANT: Once you have selected your MPI environment, log out then log back in again for the changes to take effect!

[edit] Step 2: Compiling your MPI program

Once you have run the MPI selector and logged back in, you should now have the appropriate MPI compiler wrappers in your path (e.g. "mpicc", "mpic++", "mpif77", "mpif90", etc). Compile your program by invoking the appropriate wrapper.

 richardc@athena0 ~]$ mpicc -o my_mpi_program -O2 my_mpi_program.c -lm

[edit] Step 3: Running your MPI program

You should only run MPI programs in your PBS batch script. See the Submitting Jobs section for more information on how to write and submit PBS batch scripts. To run on more than a single processor core, you script will have to specify the number of nodes and cores that you will want to use.

[edit] Specifying nodes and cores

Each Athena node has 8 cores (i.e. 2 4-core processors). For parallel programs, we recommend that you request entire nodes, 8 cores per node. You specify this with the PBS resource parameter -l nodes=<nodes>:ppn=<cores_per_node>, where <nodes> is the number of nodes that you want, and <cores_per_node> is the number of processor cores you will want on each node. We recommend ppn=8.

Note that PBS also has a -l ncpus=<num_cores> option. We strongly recommend against using this, since it will give you <num_cores> cores that are randomly distributed throughout the system, often on nodes that are running other serial jobs. Remember, there is only one network interface card on each node, so everyone on that node gets to share it.

[edit] Starting your MPI job with your batch script

In your PBS batch script, you will need to invoke your executable with the "mpirun" command, being sure to specify the PBS node file in the command line:

 #!/bin/csh
 #PBS -l nodes=16:ppn=8
 #PBS -l walltime=00:10:00
 cd $PBS_O_WORKDIR
 mpirun -np 128 -hostfile $PBS_NODEFILE ./my_mpi_program argument1 argument2

Note that if you do not use -hostfile $PBS_NODEFILE, all of your MPI processes will spawn on a single node. This will probably not yield the performance that you had hoped for. It may even take down the node if all those extra threads cause an over-commitment of memory. In this case, we will hunt you down and tie you to a more traditional-type rack (generously provided by Microsoft*) in the Athena machine room that slowly pulls your arms and legs out of their sockets.**

Here is an example MPI PBS script that derives all of its critical information (path, size, etc) from PBS environment variables:

 #!/bin/csh
 
 set pbsjobid = `echo $PBS_JOBID | awk -F . '{print $1}'`
 set numprocs = `wc -l < $PBS_NODEFILE`
 
 echo This is job `echo $pbsjobid`
 echo The master node of this job is `hostname`
 echo The working directory is `echo $PBS_O_WORKDIR`
 echo This job is running on `echo $numprocs` processors
 echo The nodefile for this job is stored in nodefile.`echo $pbsjobid`
 
 cd $PBS_O_WORKDIR
 cp $PBS_NODEFILE nodefile.$pbsjobid
 
 mpirun -np $numprocs -hostfile $PBS_NODEFILE ./tstmpi

* It was included with the SQL Server license.

** Just kidding about the rack thing.***

*** Sort of.

[edit] Additional InfiniBand Resources

Cisco IB information on compiling and linking:

http://www.cisco.com/en/US/products/ps6428/products_user_guide_chapter09186a00807a3453.html

http://www.cisco.com/en/US/products/ps6428/prod_release_note09186a00808d74f9.html