MPI and UPC Users Guide

MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementations.

  1. Slurm directly launches the tasks and performs initialization of communications (UPC, Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX, MVAPICH, MVAPICH2, some MPICH1 modes, and OpenMPI version 1.5 or higher).
  2. Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (LAM/MPI and HP-MPI).
  3. Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH (BlueGene MPI and some MPICH1 modes). These tasks are initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished.

Note: Slurm is not directly launching the user application in case 3, which may prevent the desired behavior of binding tasks to CPUs and/or accounting. Some versions of some MPI implementations work, so testing your particular installation may be required to determie the actual behavior.

Two Slurm parameters control which MPI implementation will be supported. Proper configuration is essential for Slurm to establish the proper environment for the MPI job, such as setting the appropriate environment variables. The MpiDefault configuration parameter in slurm.conf establishes the system default MPI to be supported. The srun option --mpi= (or the equivalent environment variable SLURM_MPI_TYPE can be used to specify when a different MPI implementation is to be supported for an individual job.

Links to instructions for using several varieties of MPI with Slurm are provided below.


OpenMPI

The current versions of Slurm and Open MPI support task launch using the srun command. It relies upon Slurm managing reservations of communication ports for use by the Open MPI version 1.5.

If OpenMPI is configured with --with-pmi either pmi or pmi2 the OMPI jobs can be launched directly using the srun command. This is the preferred way. If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.

Starting from Open MPI version 2.0 PMIx is natively supported. To launch Open MPI application using PMIx the '--mpi=pmix' option has to be specified on the srun command line.

For older versions of OMPI not compiled with the pmi support the system administrator must specify the range of ports to be reserved in the slurm.conf file using the MpiParams parameter. For example: MpiParams=ports=12000-12999

Alternatively tasks can be launched using the srun command plus the option --resv-ports or using the environment variable SLURM_RESV_PORT, which is equivalent to always including --resv-ports on srun's execute line. The ports reserved on every allocated node will be identified in an environment variable available to the tasks as shown here: SLURM_STEP_RESV_PORTS=12000-12015

$ salloc -n4 sh   # allocates 4 processors and spawns shell for job
> srun a.out
> exit   # exits shell spawned by initial salloc command

or

> srun -n 4 a.out

or using the pmi2 support

> srun --mpi=pmi2 -n 4 a.out

or using the pmix support

> srun --mpi=pmix -n 4 a.out

If the ports reserved for a job step are found by the Open MPI library to be in use, a message of this form will be printed and the job step will be re-launched:
srun: error: sun000: task 0 unble to claim reserved port, retrying
After three failed attempts, the job step will be aborted. Repeated failures should be reported to your system administrator in order to rectify the problem by cancelling the processes holding those ports.

NOTE: Some kernels and system configurations have resulted in a locked memory too small for proper OpemMPI functionality, resulting in application failure with a segmentation fault. This may be fixed by configuring the slurmd daemon to execute with a larger limit. For example, add "LimitMEMLOCK=infinity" to your slurmd.service file.


Intel MPI

Intel® MPI Library for Linux OS supports the following methods of launching the MPI jobs under the control of the Slurm job manager:

  • The mpirun command over the MPD Process Manager (PM)
  • The mpirun command over the Hydra PM
  • The mpiexec.hydra command (Hydra PM)
  • The srun command (Slurm, recommended)
  • This description provides detailed information on all of these methods.

    The mpirun Command over the MPD Process Manager

    Slurm is supported by the mpirun command of the Intel® MPI Library 3.1 Build 029 for Linux OS and later releases.

    When launched within a session allocated using the Slurm commands sbatch or salloc, the mpirun command automatically detects and queries certain Slurm environment variables to obtain the list of the allocated cluster nodes.

    Use the following commands to start an MPI job within an existing Slurm session over the MPD PM:

    export I_MPI_PROCESS_MANAGER=mpd
    mpirun -n <num_procs> a.out
    

    The mpirun Command over the Hydra Process Manager

    Slurm is supported by the mpirun command of the Intel® MPI Library 4.0 Update 3 through the Hydra PM by default. The behavior of this command is analogous to the MPD case described above.

    Use the one of the following commands to start an MPI job within an existing Slurm session over the Hydra PM:

    mpirun -n <num_procs> a.out
    

    or

    mpirun -bootstrap slurm -n <num_procs> a.out
    

    We recommend that you use the second command. It uses the srun command rather than the default ssh based method to launch the remote Hydra PM service processes.

    The mpiexec.hydra Command (Hydra Process Manager)

    Slurm is supported by the Intel® MPI Library 4.0 Update 3 directly through the Hydra PM.

    Use the following command to start an MPI job within an existing Slurm session:

    mpiexec.hydra -bootstrap slurm -n <num_procs> a.out
    

    The srun Command (Slurm, recommended)

    This advanced method is supported by the Intel® MPI Library 4.0 Update 3. This method is the best integrated with Slurm and supports process tracking, accounting, task affinity, suspend/resume and other features. Use the following commands to allocate a Slurm session and start an MPI job in it, or to start an MPI job within a Slurm session already created using the sbatch or salloc commands:

    • Set the I_MPI_PMI_LIBRARY environment variable to point to the Slurm Process Management Interface (PMI) library:
    • export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
      
    • Use the srun command to launch the MPI job:
    • srun -n <num_procs> a.out
      

    Above information used by permission from Intel. For more information see Intel MPI Library.


    LAM/MPI

    LAM/MPI relies upon the Slurm salloc or sbatch command to allocate. In either case, specify the maximum number of tasks required for the job. Then execute the lamboot command to start lamd daemons. lamboot utilize. Slurm's srun command to launch these daemons. Do not directly execute the srun command to launch LAM/MPI tasks. For example:

    $ salloc -n16 sh  # allocates 16 processors
                      # and spawns shell for job
    > lamboot
    > mpirun -np 16 foo args
    1234 foo running on adev0 (o)
    2345 foo running on adev1
    etc.
    > lamclean
    > lamhalt
    > exit            # exits shell spawned by
                     # initial srun command
    

    Note that any direct use of srun will only launch one task per node when the LAM/MPI plugin is configured as the default plugin. To launch more than one task per node using the srun command, the --mpi=none option would be required to explicitly disable the LAM/MPI plugin if that is the system default.


    HP-MPI

    HP-MPI uses the mpirun command with the -srun option to launch jobs. For example:

    $MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out
    


    MPICH2

    MPICH2 jobs can be launched using the srun command using pmi 1 or 2, or mpiexec. All modes of operation are described below.

    MPICH2 with srun and PMI version 2

    MPICH2 must be built specifically for use with Slurm and PMI2 using a configure line similar to that shown below.

    ./configure --with-slurm=<PATH> --with-pmi=pmi2
    

    The PATH must point to the Slurm installation directory, in other words the parent directory of bin and lib. In addition, if Slurm is not configured with MpiDefault=pmi2, then the srun command must be invoked with the option --mpi=pmi2 as shown in the example below below.

    srun -n4 --mpi=pmi2 ./a.out
    

    The PMI2 support in Slurm works only if the MPI implementation supports it, in other words if the MPI has the PMI2 interface implemented. The --mpi=pmi2 will load the library lib/slurm/mpi_pmi2.so which provides the server side functionality but the client side must implement PMI2_Init() and the other interface calls.

    You can refere yourself to mpich2-1.5 implementation and configure MPICH to use PMI2 with the --with-pmi=pmi2 configure option.

    To check if the MPI version you are using supports PMI2 check for PMI2_* symbols in the MPI library.

    Slurm provides a version of the PMI2 client library in the contribs directory. This library gets installed in the Slurm lib directory. If your MPI implementation supports PMI2 and you wish to use the Slurm provided library you have to link the Slurm provided library explicitly:

    $ mpicc -L<path_to_pim2_lib> -lpmi2 ...
    $ srun -n20 a.out
    

    MPICH2 with srun and PMI version 1

    Link your program with Slurm's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). For example:

    $ mpicc -L<path_to_slurm_lib> -lpmi ...
    $ srun -n20 a.out
    
    NOTES:
    • Some MPICH2 functions are not currently supported by the PMI library integrated with Slurm
    • Set the environment variable PMI_DEBUG to a numeric value of 1 or higher for the PMI library to print debugging information. Use srun's -l option for better clarity.
    • Set the environment variable SLURM_PMI_KVS_NO_DUP_KEYS for improved performance with MPICH2 by eliminating a test for duplicate keys.
    • The environment variables can be used to tune performance depending upon network performance: PMI_FANOUT, PMI_FANOUT_OFF_HOST, and PMI_TIME. See the srun man pages in the INPUT ENVIRONMENT VARIABLES section for a more information.
    • Information about building MPICH2 for use with Slurm is described on the MPICH2 FAQ web page and below.

    MPICH2 with mpiexec

    Do not add any flags to mpich and build the default (e.g. "./configure -prefix ... ". Do NOT pass the --with-slurm, --with-pmi, --enable-pmiport options).
    Do not add -lpmi to your application (it will force slurm's pmi 1 interface which doesn't support PMI_Spawn_multiple).
    Launch the application using salloc to create the job allocation and mpiexec to launch the tasks. A simple example is shown below.

    salloc -N 2 mpiexec my_application

    All MPI_comm_spawn work fine now going through hydra's PMI 1.1 interface.


    MPICH-GM

    MPICH-GM jobs can be launched directly by srun command. Slurm's mpichgm MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the Slurm configuration parameter MpiDefault=mpichgm in slurm.conf or srun's --mpi=mpichgm option.

    $ mpicc ...
    $ srun -n16 --mpi=mpichgm a.out
    

    MPICH-MX

    MPICH-MX jobs can be launched directly by srun command. Slurm's mpichmx MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the Slurm configuration parameter MpiDefault=mpichmx in slurm.conf or srun's --mpi=mpichmx option.

    $ mpicc ...
    $ srun -n16 --mpi=mpichmx a.out
    

    MVAPICH

    MVAPICH jobs can be launched directly by srun command. Slurm's mvapich MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the Slurm configuration parameter MpiDefault=mvapich in slurm.conf or srun's --mpi=mvapich option.

    $ mpicc ...
    $ srun -n16 --mpi=mvapich a.out
    
    NOTE: If MVAPICH is used in the shared memory model, with all tasks running on a single node, then use the mpich1_shmem MPI plugin instead.
    NOTE (for system administrators): Configure PropagateResourceLimitsExcept=MEMLOCK in slurm.conf and start the slurmd daemons with an unlimited locked memory limit. For more details, see MVAPICH documentation for "CQ or QP Creation failure".


    MVAPICH2

    MVAPICH2 supports launching multithreaded programs by Slurm as well as mpirun_rsh. Please note that if you intend to use use srun, you need to build MVAPICH2 with Slurm support with a command line of this sort:

    $ ./configure --with-pmi=pmi2 --with-pm=slurm
    

    Use of Slurm's pmi2 plugin provides substantially higher performance and scalability than Slurm's pmi plugin. If pmi2 is not configured to be Slurm's default MPI plugin at your site, this can be specified using the srun command's "--mpi-pmi2" option as shown below or with the environment variable setting of "SLURM_MPI_TYPE=pmi2".

    $ srun -n16 --mpi=pmi2 a.out
    

    For more information, please see the MVAPICH2 User Guide


    BlueGene MPI

    IBM BlueGene/Q Systems rely upon Slurm to create a job's resource allocation.

    BlueGene/Q

    The BlueGene/Q systems support the ability to allocate different portions of a BlueGene block to different users and different jobs, so Slurm must be directly involved in each task launch request.

    The following is subject to change in order to support debugging. In order to accomplish this, Slurm's srun command is executed to launch tasks. The srun command creates a job step allocation which is linked to IBM's runjob libraries which will launch the tasks within the allocated resources.

    See BlueGene/Q User and Administrator Guide for more information.


    MPICH1

    MPICH1 development ceased in 2005. It is recommended that you convert to MPICH2 or some other MPI implementation. If you still want to use MPICH1, note that it has several different programming models. If you are using the shared memory model (DEFAULT_DEVICE=ch_shmem in the mpirun script), then initiate the tasks using the srun command with the --mpi=mpich1_shmem option.

    $ srun -n16 --mpi=mpich1_shmem a.out
    

    NOTE: Using a configuration of MpiDefault=mpich1_shmem will result in one task being launched per node with the expectation that the MPI library will launch the remaining tasks based upon environment variables set by Slurm. Non-MPI jobs started in this configuration will lack the mechanism to launch more than one task per node unless srun's --mpi=none option is used.

    If you are using MPICH P4 (DEFAULT_DEVICE=ch_p4 in the mpirun script), then it is recommended that you apply the patch in the Slurm distribution's file contribs/mpich1.slurm.patch. Follow directions within the file to rebuild MPICH. Applications must be relinked with the new library. Initiate tasks using the srun command with the --mpi=mpich1_p4 option.

    $ srun -n16 --mpi=mpich1_p4 a.out
    

    Note tha. Slurm launches one task per node and the MPICH library linked within your applications launches the other tasks with shared memory used for communications between them. The only real anomaly is that all output from all spawned tasks on a node appear to Slurm as coming from the one task that it launched. If the srun --label option is used, the task ID labels will be misleading.

    Other MPICH1 programming models current rely upon the Slurm salloc or sbatch command to allocate resources. In either case, specify the maximum number of tasks required for the job. You may then need to build a list of hosts to be used and use that as an argument to the mpirun command. For example:

    $ cat mpich.sh
    #!/bin/bash
    srun hostname -s | sort -u >slurm.hosts
    mpirun [options] -machinefile slurm.hosts a.out
    rm -f slurm.hosts
    $ sbatch -n16 mpich.sh
    sbatch: Submitted batch job 1234
    

    Note that in this example, mpirun uses the rsh command to launch tasks. These tasks are not managed b. Slurm since they are launched outside of its control.


    Quadrics MPI

    Quadrics MPI relies upon Slurm to allocate resources for the job and srun to initiate the tasks. One would build the MPI program in the normal manner then initiate it using a command line of this sort:

    $ srun [options] <program> [program args]
    

    UPC (Unified Parallel C)

    Berkeley UPC (and likely other UPC implementations) rely upon Slurm to allocate resources and launch the application's tasks. The UPC library then read. Slurm environment variables in order to determine how the job's task count and location. One would build the UPC program in the normal manner then initiate it using a command line of this sort:

    $ srun -N4 -n16 a.out
    

    Last modified 23 June 2015