How to mix MPI and CUDA in a single program

MPI is a well-known programming model for Distributed Memory Computing. If you have access to GPU resources, MPI can be used to distribute tasks to computers, each of which can use their CPU and also GPU to process the distributed task.

My toy problem in hand is to use  a mix of MPI and CUDA to handle traditional sparse-matrix vector multiplication. The program can be structured as:

Each node uses both CPU and GPU resources

Each node uses both CPU and GPU resources

  1. Read a sparse matrix from from disk, and split it into sub-matrices.
  2. Use MPI to distribute the sub-matrices to processes.
  3. Each process would call a CUDA kernel to handle the multiplication. The result of multiplication would be copied back to each computer memory.
  4. Use MPI to gather results from each of the processes, and re-form the final matrix.

One of the options is to put both MPI and CUDA code in a single file, This program can be compiled using nvcc, which internally uses gcc/g++ to compile your C/C++ code, and linked to your MPI library:

nvcc -I/usr/mpi/gcc/openmpi-1.4.6/include -L/usr/mpi/gcc/openmpi-1.4.6/lib64 -lmpi -o program

The downside is it might end up being a plate of spaghetti, if you have some seriously long program.

Another cleaner option is to have MPI and CUDA code separate in two files: main.c and respectively. These two files can be compiled using mpicc, and nvcc respectively into object files (.o) and combined into a single executable file using mpicc. This second option is an opposite compilation of the above, using mpicc, meaning that you have to link to your CUDA library.

module load openmpi cuda #(optional) load modules on your node
mpicc -c main.c -o main.o
nvcc -arch=sm_20 -c -o multiply.o
mpicc main.o multiply.o -lcudart -L/apps/CUDA/cuda-5.0/lib64/ -o program

And finally, you can request two processes and two GPUs to test your program on the cluster using PBS script like:

#PBS -l nodes=2:ppn=2:gpus=2
mpiexec -np 2 ./program

The main.c, containing the call to CUDA file, would look like:

#include "mpi.h"

int main(int argc, char *argv[])
/* It's important to put this call at the begining of the program, after variable declarations. */
MPI_Init(argc, argv);

/* Get the number of MPI processes and the rank of this process. */
        MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
        MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

// ==== Call function 'call_me_maybe' from CUDA file ==========

/* ... */


And in, define call_me_maybe() with the 'extern' keyword to make it accessible from main.c (without additional #include ...)

/* */
#include <cuda.h>
#include <cuda_runtime.h>

 __global__ void __multiply__ ()

 extern "C" void call_me_maybe()
     /* ... Load CPU data into GPU buffers  */

     __multiply__ <<< ...block configuration... >>> (x, y);

     /* ... Transfer data from GPU to CPU */

Rock on \m/

  • Melissa Wiederrecht

    Thank you Anh! I just happen to be working on exactly that problem you describe and couldn’t remember how to compile everything all together.

    BTW, your post showed up right on the first page of Google when I searched “compile mpi in .cu file”. Nice website you have here!

    • Hi Melissa,

      Thanks for coming by and letting me know that my site appeared on the first page!
      I hope you find the info useful.

      Merry Xmas!


  • Karishma Bansole

    Sir,this post is very helpful to me.I want to learn more about MPI-CUDA mix programming ,can you suggest any reference book for learning mix MPI-CUDA programming.

    Thank you

  • M.usman ashraf

    Nice example to work for CUDA with MPI.
    As Exascale is requiring more powerful parallel programming approach. So, this model could be considered as research toward exascale computing.

  • Madhura

    Good example for mix programming of mpi-cuda.
    Sir,please tell me on which cluster you are deploy this program?
    Is it possible to run program on beowulf cluster?

    • I’m running it on my University’s internal cluster. I don’t know about beowulf to comment, but this combination should run on both a single computer and a cluster: it’s solely depends on the user.

  • Fabian

    Hi everyone,
    I’m working on a cluster but I use this command to launch the app after the respective compilation and linking steps
    “mpiexec –hostfile hostlist -np 2 ./program ”
    where “hostlist”(a list of nodes I want to run my app) contains:

    And i get the following error:
    “./program: error while loading shared libraries: cannot open shared object file: No such file or directory”

    I don’t know what I’m doing wrong, can anybody help me?, please.

  • Evgeny

    Thank you very much for the first example! This was extremely useful as we have to do some computations on the mpi+cuda. Unfortunately, I couldn’t compile multiple files, but I compiled all in one file and was sooooo happy!

  • Ian Snow

    Beautiful tips!