Slurm MPI Plugin API


This document describes Slurm MPI selection plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm node selection plugins.

Slurm MPI selection plugins are Slurm plugins that implement the which version of mpi is used during execution of the new Slurm job. API described herein. They are intended to provide a mechanism for both selecting MPI versions for pending jobs and performing any mpi-specific tasks for job launch or termination. The plugins must conform to the Slurm Plugin API with the following specifications:

const char plugin_type[]
The major type must be "mpi." The minor type can be any recognizable abbreviation for the type of node selection algorithm. We recommend, for example:

  • openmpi — For use with older versions of OpenMPI.
  • pmi2 — For use with MPI2 and MVAPICH2.
  • pmix — Exascale PMI implementation (currently supported by OpenMPI starting from version 2.0)
  • none — For use with most other versions of MPI.

const char plugin_name[]
Some descriptive name for the plugin. There is no requirement with respect to its format.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loadeed by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

A simplified flow of logic follows:
srun is able to specify the correct mpi to use. with --mpi=MPITYPE
srun calls
mpi_p_thr_create((srun_job_t *)job);
which will set up the correct environment for the specified mpi.
slurmd daemon runs
mpi_p_init((stepd_step_rec_t *)job, (int)rank);
which will set configure the slurmd to use the correct mpi as well to interact with the srun.
slurmstepd process runs
p_mpi_hook_slurmstepd_prefork(const stepd_step_rec_t *job, char ***env);
which executes immediately before fork/exec of tasks.

Data Objects

These functions are expected to read and/or modify data structures directly in the slurmd daemon's and srun memory. Slurmd is a multi-threaded program with independent read and write locks on each data structure type. Therefore the type of operations permitted on various data structures is identified for each function.

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

int init (void)

Called when the plugin is loaded, before any other functions are called. Put global initialization here.

SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Called when the plugin is removed. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the Slurm init(), and the Slurm fini() is called before the system's _fini().

int mpi_p_init (stepd_step_rec_t *job, int rank);

Description: Used by slurmd to configure the slurmd's environment to that of the correct mpi.

job    (input) Pointer to the slurmd_job that is running. Cannot be NULL.
rank    (input) Primarily there for MVAPICH. Used to send the rank fo the mpirun job. This can be 0 if no rank information is needed for the mpi type.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int mpi_p_thr_create (srun_job_t *job);

Description: Used by srun to spawn the thread for the mpi processes. Most all the real processing happens here.

Arguments: job    (input) Pointer to the srun_job that is running. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return -1.

int mpi_p_single_task ();

Description: Tells the system whether or not multiple tasks can run at the same time

Arguments: none

Returns: false if multiple tasks can run and true if only a single task can run at one time.

int p_mpi_hook_slurmstepd_prefork(const stepd_step_rec_t *job, char ***env);

Description: Used by slurmstepd process immediately prior to fork and exec of user tasks.

job   (input) Pointer to the slurmd structure for the job that is running.
env   (input) Environment variables for tasks to be spawned.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return -1.

int mpi_p_exit();

Description: Cleans up anything that needs cleaning up after execution.

Arguments: none

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, causing slurmctld to exit.

Last modified 15 September 2017