Slurm Profile Accounting Plugin API (AcctGatherProfileType)

Overview

This document describes Slurm profile accounting plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm profile accounting plugins.

A profiling plugin allows more detailed information on the execution of jobs than can reasonably be kept in the accounting database. (All jobs may also not be profiled.) A seperate User Guide documents how to use the hdf5 version of the plugin.

The plugin provides an API for making calls to store data at various points in a step's lifecycle. It collects data periodically from potentially several sources. The periodic samples are eventually consolidated into one time series dataset for each node of a job.

The plugin's primary work is done within slurmstepd on the compute nodes. It assumes a shared file system, presumably on the management network. This avoids having to transfer files back to the controller at step end. Data is typically gathered at job_acct_gather interval or acct_gather_energy interval and the volume is not expected to be burdensome.

The hdf5 implementation records I/O counts from the network interface (Infiniband), I/O counts from the node from the Lustre parallel file system, disk I/O counts, cpu and memory utilization for each task, and a record of energy use.

This implementation stores this data in a HDF5 file for each step on each node for the jobs. A separate program (sh5util) is provided to consolidate all the node-step files in one container for the job. HDF5 is a well known structured data set that allows different types of related data to be stored in one file. Its internal structure resembles a file system with groups being similar to directories and data sets being similar to files. There are commodity programs, notably HDF5View for viewing and manipulating these files. sh5util also provides some capability for extracting subsets of date for import into other analysis tools like spreadsheets.

This plugin is incompatible with --enable-front-end. It you need to simulate a large configuration, please use --enable-multiple-slurmd.

Slurm profile accounting plugins must conform to the Slurm Plugin API with the following specifications:

const char plugin_name[]="full text name"

A free-formatted ASCII text string that identifies the plugin.

const char plugin_type[]="major/minor"

The major type must be "acct_gather_profile." The minor type can be any suitable name for the type of profile accounting. We currently use

  • none — No profile data is gathered.
  • hdf5 — Gets profile data about energy use, i/o sources (Lustre, network) and task data such as local disk i/o, CPU and memory usage.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loadeed by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

The programmer is urged to study src/plugins/acct_gather_profile/acct_gather_profile_hdf5.c and src/common/slurm_acct_gather_profile.c for a sample implementation of a Slurm profile accounting plugin.

API Functions

All of the following functions are required. Functions which are not implemented must be stubbed.

int init (void)

Description:
Called when the plugin is loaded, before any other functions are called. Put global initialization here.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Description:
Called when the plugin is removed. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the SLURM init(), and the SLURM fini() is called before the system's _fini().

void acct_gather_profile_g_conf_options(void)

Description:
Called from slurmstepd between fork() and exec() of application. Close open files

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void acct_gather_profile_g_conf_options(s_p_options_t **full_options, int *full_options_cnt)

Description:
Defines configuration options in acct_gather.conf

Arguments:
full(out) option definitions. full_options_cnt(out) number in full.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void acct_gather_profile_g_conf_set(s_p_hashtbl_t *tbl)

Description:
Set configuration options from acct_gather.conf

Arguments:
tbl -- hash table of options./span>

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void acct_gather_profile_g_conf_get(s_p_hashtbl_t *tbl)

Description:
Gets configuration options from acct_gather.conf

Returns:
void* pointer to slurm_acct_gather_conf_t on success, or
NULL on failure.

int acct_gather_profile_p_node_step_start(stepd_step_rec_t* job)

Description:
Called once per step on each node from slurmstepd, before launching tasks.
Provides an opportunity to create files and other node-step level initialization.

Arguments:
job -- slumd_job_t structure containing information about the step.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

int acct_gather_profile_p_node_step_end(stepd_step_rec_t* job)

Description:
Called once per step on each node from slurmstepd, after all tasks end.
Provides an opportunity to close files, etc.

Arguments:
job -- slumd_job_t structure containing information about the step.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

int acct_gather_profile_p_task_start(stepd_step_rec_t* job, uint32_t taskid)

Description:
Called once per task from slurmstepd, BEFORE node step start is called.
Provides an opportunity to gather beginning values from node counters (bytes_read ...)

Arguments:
job -- slumd_job_t structure containing information about the step.
taskid -- Slurm taskid.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

int acct_gather_profile_p_task_end(stepd_step_rec_t* job, pid_t taskpid)

Description:
Called once per task from slurmstepd.
Provides an opportunity to put final data for a task.

Arguments:
job -- slumd_job_t structure containing information about the step.
pid -- task process id (pid_t).

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

int acct_gather_profile_p_add_sample_data(uint32_t type, void* data);

Description:
Put data at the Node Samples level. Typically called from something called at either job_acct_gather interval or acct_gather_energy interval.
All samples in the same group will eventually be consolidated in one time series.

Arguments:

type -- identifies the type of data.
data -- data structure to be put to the file.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

Parameters

These parameters can be used in the slurm.conf to configure the plugin and the frequency at which to gather node profile data.

AcctGatherProfileType
Specifies which plugin should be used.

The acct_gather.conf provides profile configuration options.

ProfileDir
Path to location in a shared file system in which to write profile data. There is no default as there is no standard location for a shared file system. It this parameter is not specified, no profiling will occur.
ProfileDefaultProfile
Default setting for --profile command line option for srun, salloc, sbatch.

The default profile value is none which means no profiling will be done for jobs. The hdf5 plugin also includes;

  • energy sample energy use for the node.
  • lustre sample i/o to the Lustre file system for the node.
  • network sample i/o through the network (infiniband) interface for the node.
  • task sample local disk I/O, cpu and memory use for each task.
  • all all of the above.

Use caution when setting the default to values other than none as a file for each job will be created. This option is provided for test systems.

Most of the sources of profile data are associated with various acct_gather plugins. The acct_gather.conf file has setting for various sampling mechanisms that can be used to change the frequency at which samples occur.

Data Types

A plugin-like structure is implemented to generalize HDF5 data operations from various sources. A C typedef is defined for each datatype. These declarations are in /common/slurm_acct_gather_profile.h so the datatype are common to all profile plugins.

The operations are defined via structures of function pointers, and they are defined in /plugins/acct_gather_profile/common/profile_hdf5.h and should work on any HDF5 implementation, not only hdf5.

Functions must be implemented to perform various operations for the datatype. The api for the plugin includes an argument for the datatype so that the implementation of that api can call the specific operation for that datatype.

Groups in the HDF5 file containing a dataset will include an attribute for the datatype so that the program that merges step files into the job can discover the type of the group and do the right thing.

For example, the typedef for the energy sample datatype;

typedef struct profile_energy {
    char     tod[TOD_LEN];	// Not used in node-step
    time_t   time;
    uint64_t watts;
    uint64_t cpu_freq;
} profile_energy_t;

A factory method is implemented for each type to construct a structure with functions implementing various operations for the type. The following structure of functions is required for each type.

/*
 * Structure of function pointers of common operations on a
 * profile data type. (Some may be stubs, particularly if the data type
 * does not represent a time series.
 *	dataset_size -- size of one dataset (structure size).
 *      create_memory_datatype -- creates hdf5 memory datatype
 *          corresponding to the datatype structure.
 *      create_file_datatype -- creates hdf5 file datatype
 *          corresponding to the datatype structure.
 *      create_s_memory_datatype -- creates hdf5 memory datatype
 *          corresponding to the summary datatype structure.
 *      create_s_file_datatype -- creates hdf5 file datatype
 *          corresponding to the summary datatype structure.
 *      init_job_series -- allocates a buffer for a complete time
 *          series (in job merge) and initializes each member
 *      merge_step_series -- merges all the individual time samples
 *          into a single data set with one item per sample.
 *          Data items can be scaled (e.g. subtracting beginning time)
 *          differenced (to show counts in interval) or other things
 *          appropriate for the series.
 *      series_total -- accumulate or average members in the entire
 *          series to be added to the file as totals for the node or
 *          task.
 *      extract_series -- format members of a structure for putting
 *          to a file data extracted from a time series to be imported into
 *          another analysis tool. (e.g. format as comma separated value.)
 *      extract_totals -- format members of a structure for putting
 *          to a file data extracted from a time series total to be imported
 *          into another analysis tool. (e.g. format as comma,separated value.)
 */
typedef struct profile_hdf5_ops {
    int   (*dataset_size) ();
    hid_t (*create_memory_datatype) ();
    hid_t (*create_file_datatype) ();
    hid_t (*create_s_memory_datatype) ();
    hid_t (*create_s_file_datatype) ();
    void* (*init_job_series) (int, int);
    void  (*merge_step_series) (hid_t, void*, void*, void*);
    void* (*series_total) (int, void*);
    void  (*extract_series) (FILE*, bool, int, int, char*,
				       char*, void*);
    void  (*extract_totals) (FILE*, bool, int, int, char*,
				       char*, void*);
} profile_hdf5_ops_t;

Note there are two different data types for supporting time series.
1) A primary type is defined for gathering data in the node step file. It is typically named profile_{series_name}_t.
2) Another type is defined for summarizing series totals. It is typically named profile_{series_name}_s_t. It does not have a 'factory'. It is only used in the functions of the primary data type and the primaries structure has operations to create appropriate hdf5 objects.

When adding a new type, the profile_factory function has to be modified to return an ops for the type.

Interaction between type and hdf5.

  • The profile_{type}_t structure is used by callers of the add_sample_data functions.
  • HDF5 needs a memory_datatype to transform this structure into its dataset object in memory. The create_memory_datatype function creates the appropriate object.
  • HDF5 needs a file_datatype to transform the dataset into how it will be written to the HDF5 file (or to transform what it reads from a file into a dataset.) The create_file_datatype function creates the appropriate object.

Last modified 27 March 2015