Slurm Burst Buffer Guide
- Configuration (for system administrators)
- Job Submission Commands
- Persistent Burst Buffer Creation and Deletion Directives
- Interactive Job Options
- Status Commands
- Advanced Reservations
- Job Dependencies
Slurm includes support for burst buffers, a shared high-speed storage resource. Slurm provides support for allocating these resources, staging files in, scheduling compute nodes for jobs using these resources, then staging files out. Burst buffers can also be used as temporary storage during a job's lifetime, without file staging. Another typical use case is for persist storage, not associated with any specific job. This support is provided using a plugin mechanism so that a various burst buffer infrastructures may be easily configured. Two plugins are provided initially:
- datawarp - Uses Cray APIs to perform underlying management functions
- generic - Uses system administrator defined scripts to perform underlying management functions, currently under development
Additional plugins may be provided in future releases of Slurm.
Slurm's mode of operation follows this general pattern:
- Read configuration information and initial state information
- After expected start times for pending jobs are established, allocate burst buffers to those jobs expected to start earliest and start stage-in of required files
- After stage-in has completed, jobs can be allocated compute nodes and begin execution
- After job has completed execution, begin file stage-out from burst buffer
- After file stage-out has completed, burst buffer can be released and the job record purged
Burst buffer support in Slurm is enabled by specifying the plugin(s) to loaded for managing these resources using the BurstBufferType configuration parameter in the slurm.conf file. Multiple plugin names may be specified in a comma separated list. Detailed logging of burst buffer specific actions may be generated for debugging purposes by using the DebugFlags=BurstBuffer configuration parameter. The BurstBuffer DebugFlags (like many other DebugFlags) can result in very verbose logging and is only intended for diagnostic purposes rather than for use in a production system.
# Excerpt of example slurm.conf file BurstBufferType=burst_buffer/generic # DebugFlags=BurstBuffer # Commented out
Burst buffer specific options should be defined in a burst_buffer.conf file. If multiple burst buffer plugins are configured, an independent configuration file can be specified for each plugin with a file name including the plugin name. For example files named "burst_buffer_datawarp.conf" and "burst_buffer_generic.conf" can be used independently by the datawarp and generic burst buffer plugins respectively. This file can contain information about who can or can not use burst buffers, timeouts, and paths of scripts used to perform various functions, etc. TRES limits can be configured to establish limits by association, QOS, etc. The size of a job's burst buffer requirements can be used as a factor in setting the job priority as described in the multifactor priority document.
Note that sample scripts for performing burst buffer operations for the burst_buffer/generic plugin are included with the Slurm distribution. These scripts were intended for Slurm development and testing purposes and are not currently intended for customer use. Contributions of usable scripts/programs are invited. For more information about the functionality that might be included in the scripts, please see The Ramdisk Storage Accelerator by Timothy B. Wickberg. Note that these scripts/programs are executed as SlurmUser rather than user root. Please see the burst_buffer.conf man page for more configuration information.
# Excerpt of burst_buffer.conf file for generic plugin AllowUsers=alan,brenda Flags=EnablePersistent,PrivateData Granularity=1GB StageInTimeout=30 StageOutTimeout=30 # GetSysState=/usr/local/slurm/15.08/sbin/GSS StartStageIn=/usr/local/slurm/15.08/sbin/SSI StartStageOut=/usr/local/slurm/15.08/sbin/SSO StopStageIn=/usr/local/slurm/15.08/sbin/PSI StopStageOut=/usr/local/slurm/15.08/sbin/PSO
Note for Cray systems: The JSON-C library must be installed in order to build Slurm's burst_buffer/datawarp plugin, which must parse JSON format data. See Slurm's JSON installation information for details.
The normal mode of operation is for batch jobs to specify burst buffer requirements within the batch script. Batch script lines containing a prefix of "#BB" identify the job's burst buffer space requirements, files to be staged in, files to be staged out, etc. when using the burst_buffer/generic plugin. The prefix of "#DW" (for "DataWarp") is used for burst buffer directives when using the burst_buffer/datawarp plugin. Please reference Cray documentation for details about the DataWarp options. For DataWarp systems, the prefix of "#BB" can be used to create or delete persistent burst buffer storage (NOTE: The "#BB" prefix is used since the command is interpreted by Slurm and not by the Cray Datawarp software). Interactive jobs (those submitted using the salloc and srun commands) can specify their burst buffer space requirements using the "--bb" or "--bbf" command line options, as described later in this document. All of the "#SBATCH", "#DW" and "#BB" directives should be placed at the top of the script (before any non-comment lines). All of the persistent burst buffer creations and deletions happen before the job's compute portion happens. In a similar fashion, you can't stage files in/out at various points in the script execution. All file stage-in happens prior to the job's compute portion and all file stage-out happens after computation.
The salloc and srun commands can create and use job-specific burst buffers. For both commands, the "--bb" or "--bbf" option is used to specify the job's burst buffer requirements. Note the burst buffer may not be accessible from a login node, but require that salloc spawn a shell on one of it's allocated compute nodes. See the description of SallocDefaultCommand in the slurm.conf man page for more information about how to spawn a remote shell.
A basic validation is performed on the job's burst buffer options at job submit time. If the options are invalid, the job will be rejected and an error message will be returned directly to the user. Note that unrecognized options may be ignored in order to support backward compatibility (i.e. a job submission would not fail in the case of an option being specified that is recognized by some versions of Slurm, but not recognized by other versions). If the job is accepted, but later fails (e.g. some problem staging files), the job will be held and its "Reason" field will be set to error message provided by the underlying infrastructure.
Users may also request to be notified by email upon completion of burst buffer stage out using the "--mail-type=stage_out" or "--mail-type=all" option. The subject line of the email will be of this form:
SLURM Job_id=12 Name=my_app Staged Out, StageOut time 00:05:07
These options are used for both the burst_buffer/datawarp and burst_buffer/generic plugins to create and delete persistent burst buffers, which have a lifetime independent of the job.
- #BB create_persistent name=<name> capacity=<number> [access=<access>] [pool=<pool> [type=<type>]
- #BB destroy_persistent name=<name> [hurry]
The persistent burst buffer name may not start with a numeric value (numeric names are reserved for job-specific burst buffers). The capacity (size) specification can include a suffix of "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the CRAY API for supports both formats. The access parameter identifies the buffer access mode. Supported access modes for the burst_buffer/datawarp plugin include: striped, private, and ldbalance. The pool parameter identifies the resource pool from in the burst buffer should be created. The default and available pools are configuration dependent. The type parameter identifies the buffer type. Supported type modes for the burst_buffer/datawarp plugin include: cache and scratch. Multiple persistent burst buffers may be created or deleted within a single job. A sample batch script follows:
#!/bin/bash #BB create_persistent name=alpha capacity=32GB access=striped type=scratch #DW jobdw type=scratch capacity=1GB access_mode=striped #DW stage_in type=file source=/home/alan/data.in destination=$DW_JOB_STRIPED/data #DW stage_out type=file destination=/home/alan/data.out source=$DW_JOB_STRIPED/data /home/alan/a.out
Persistent burst buffers can be created and deleted by a job requiring no compute resources. Submit a job with the desired burst buffer directives and specify a node count of zero (e.g. "sbatch -N0 setup_buffers.bash"). Attempts to submit a zero size job without burst buffer directives or with job-specific burst buffer directives will generate an error. Note that zero size jobs are not supported for job arrays or heterogeneous job allocations.
NOTE: The ability to create and destroy persistent burst buffers may be limited by the "Flags" option in the burst_buffer.conf file. See the burst_buffer.conf man page for more information. By default only privileged users (Slurm operators and administrators) can create or destroy persistent burst buffers.
Interactive jobs may include directives for creating job-specific burst buffers as well as file staging. These options may be specified using either the "--bb" or "--bbf" option of the salloc or srun command. Note that support for creation and destruction of persistent burst buffers using the "--bb" option is not provided. The "--bbf" option take as an argument a filename and that file should contain a collection of burst buffer operations identical to that used for batch jobs. This file may contain file staging directives. Alternately the "--bb" option may be used to specify burst buffer directives as the option argument. The format of those directives can either be identical to those used in a batch script OR a very limited set of directives can be uses, which are translated to the equivalent script for later processing. Multiple directives should be space separated.
If a swap option is specified, the job must also specify the required node count. The capacity (size) specification can include a suffix of "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the CRAY API for supports both formats. A sample command line follows and we also show the equivalent burst buffer script generated by the options:
# Sample execute line: srun --bb="capacity=1G access=striped type=scratch" a.out # Equivalent script as generated by Slurm's burst_buffer/datawarp plugin #DW jobdw capacity=1GiB access_mode=striped type=scratch
Slurm's current burst buffer state information is available using the scontrol show burst command or by using the sview command's Burst Buffer tab. A sample scontrol output is shown below. The scontrol "-v" option may be used for a more verbose output format.
$ scontrol show burst Name=generic DefaultPool=ssd Granularity=100G TotalSpace=50T UsedSpace=42T StageInTimeout=30 StageOutTimeout=30 Flags=EnablePersistent,PrivateData AllowUsers=alan,brenda CreateBuffer=/usr/local/slurm/17.11/sbin/CB DestroyBuffer=/usr/local/slurm/17.11/sbin/DB GetSysState=/usr/local/slurm/17.11/sbin/GSS StartStageIn=/usr/local/slurm/17.11/sbin/SSI StartStageIn=/usr/local/slurm/17.11/sbin/SSO StopStageIn=/usr/local/slurm/17.11/sbin/PSI StopStageIn=/usr/local/slurm/17.11/sbin/PSO Allocated Buffers: JobID=18 CreateTime=2017-08-19T16:46:05 Pool=dwcache Size=10T State=allocated UserID=alan(1000) JobID=20 CreateTime=2017-08-19T16:46:45 Pool=dwcache Size=10T State=allocated UserID=alan(1000) Name=DB1 CreateTime=2017-08-19T16:46:45 Pool=dwcache Size=22T State=allocated UserID=brenda(1001) Per User Buffer Use: UserID=alan(1000) Used=20T UserID=brenda(1001) Used=22T
Access to the Cray burst buffer status tool, dwstat, is available from the scontrol command using the "scontrol show bbstat ..." or "scontrol show dwstat ..." command. Options following "bbstat" or "dwstat" on the scontrol execute line are passed directly to the dwstat command as shown below. See Cray DataWarp documentation for details about dwstat options and output.
/opt/cray/dws/default/bin/dwstat $ scontrol show dwstat pool units quantity free gran' wlm_pool bytes 7.28TiB 7.28TiB 1GiB' $ scontrol show dwstat sessions sess state token creator owner created expiration nodes 832 CA--- 783000000 tester 12345 2015-09-08T16:20:36 never 20 833 CA--- 784100000 tester 12345 2015-09-08T16:21:36 never 1 903 D---- 1875700000 tester 12345 2015-09-08T17:26:05 never 0 $ scontrol show dwstat configurations conf state inst type access_type activs 715 CA--- 753 scratch stripe 1 716 CA--- 754 scratch stripe 1 759 D--T- 807 scratch stripe 0 760 CA--- 808 scratch stripe 1
Burst buffer resources can be placed in an advanced reservation using the
The argument consists of four elements:
plugin is the burst buffer plugin name, currently either "datawarp" or "generic". If no plugin is specified, the reservation applies to all configured burst buffer plugins.
pool specifies a Cray generic burst buffer resource pool. if "type" is not specified, the number is a measure of storage space.
units may be "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). The default units are bytes for reservations of storage space. NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the CRAY API for supports both formats.
Jobs using this reservation are not restricted to these burst buffer resources, but may use these reserved resources plus any which are generally available. Some examples follow.
$ scontrol create reservation starttime=now duration=60 \ users=alan flags=any_nodes \ burstbuffer=datawarp:100G,generic:20G $ scontrol create reservation StartTime=noon duration=60 \ users=brenda NodeCnt=8 \ BurstBuffer=datawarp:20G $ scontrol create reservation StartTime=16:00 duration=60 \ users=joseph flags=any_nodes \ BurstBuffer=datawarp:pool_test:4G
If two jobs use burst buffers and one is depend on the other (e.g. "sbatch --dependency=afterok:123 ...") then the second job will not begin until the first job completes and it's burst buffer stage out completes. If the second job does not use a burst buffer, but is dependent upon the first job's completion, then it will not wait for the stage out operation of the first job to complete. The second job can be made to wait for the first job's stage out operation to complete using the "afterburstbuffer" dependency option (e.g. "sbatch --dependency=afterburstbuffer:123 ...").
Last modified 7 March 2019