Slurm Process Tracking Plugin API

Overview

This document describes Slurm process tracking plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm process tracking plugins. Note that process tracking plugin is designed for use with Slurm job steps. There is a job_container plugin designed for use with Slurm jobs.

Slurm process tracking plugins are Slurm plugins that implement the Slurm process tracking API described herein. They must conform to the Slurm Plugin API with the following specifications:

const char plugin_type[]
The major type must be "proctrack." The minor type can be any recognizable abbreviation for the type of proctrack. We recommend, for example:

  • cgroup — Use Linux cgroups for process tracking.
  • linuxproc — Perform process tracking based upon a scan of the Linux process table and use the parent process ID to determine what processes are members of a Slurm job. NOTE: This mechanism is not entirely reliable for process tracking.
  • lua — Use site-defined Lua script for process tracking. Sample Lua scripts can be found with the Slurm distribution in the directory contribs/lua. The default installation location of the Lua scripts is the same location as the Slurm configuration file, slurm.conf.
  • pgid — Use process group ID to determine what processes are members of a Slurm job. NOTE: This mechanism is not entirely reliable for process tracking.
  • rms — Use a Quadrics RMS kernel patch to establish what processes are members of a Slurm job. NOTE: This requires a kernel patch that records every process creation and termination.
  • sgj_job — Use SGI's Process Aggregates (PAGG) kernel module. NOTE: This kernel module records every process creation and termination.

const char plugin_name[]
Some descriptive name for the plugin. There is no requirement with respect to its format.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loadeed by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

The programmer is urged to study src/plugins/proctrack/pgid/proctrack_pgid.c for an example implementation of a Slurm proctrack plugin.

Data Objects

The implementation must support a container id of type uint64_t. This container ID is maintained by the plugin directly in the slurmd job structure using the field named cont_id.

The implementation must maintain (though not necessarily directly export) an enumerated errno to allow Slurm to discover as practically as possible the reason for any failed API call. These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. Successful API calls are not required to reset errno to a known value.

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

int init (void)

Description:
Called when the plugin is loaded, before any other functions are called. Put global initialization here.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Description:
Called when the plugin is removed. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the SLURM init(), and the SLURM fini() is called before the system's _fini().

int proctrack_p_create (stepd_step_rec_t *job);

Description: Create a container. The caller should insure that be valid proctrack_p_destroy() is called. This function must put the container ID directory in the job structure's variable cont_id.

Argument: job    (input/output) Pointer to a slurmd job structure.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int proctrack_p_add (stepd_step_rec_t *job, pid_t pid);

Description: Add a specific process ID to a given job step's container.

Arguments:
job    (input) Pointer to a slurmd job structure.
pid    (input) The ID of the process to add to this job's container.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int proctrack_p_signal (uint64_t id, int signal);

Description: Signal all processes in a given job step container.

Arguments:
id   (input) Job step container's ID.
signal   (input) Signal to be sent to processes. Note that a signal of zero just tests for the existence of processes in a given job step container.

Returns: SLURM_SUCCESS if the signal was sent. If the signal can not be sent, the function should return SLURM_ERROR and set its errno to an appropriate value to indicate the reason for failure.

int proctrack_p_destroy (uint64_t id);

Description: Destroy or otherwise invalidate a job step container. This does not imply the container is empty, just that it is no longer needed.

Arguments: id    (input) Job step container's ID.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

uint64_t proctrack_p_find (pid_t pid);

Description: Given a process ID, return its job step container ID.

Arguments: pid    (input) A process ID.

Returns: The job step container ID with this process or zero if none is found.

uint32_t proctrack_p_get_pids (uint64_t cont_id, pid_t **pids, int *npids);

Description: Given a process container ID, fill in all the process IDs in the container.

Arguments: cont_id    (input) A job step container ID.
pids    (output) Array of process IDs in the container.
npids    (output) Count of process IDs in the container.

Returns: SLURM_SUCCESS if successful, SLURM_ERROR else.

Last modified 27 March 2015