Heterogeneous Job Support

Overview

Slurm version 17.11 and later supports the ability to submit and manage heterogeneous jobs, in which each component has virtually all job options available including partition, account and QOS (Quality Of Service). For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.

Submitting Jobs

The salloc, sbatch and srun commands can all be used to submit heterogeneous jobs. Resource specifications for each component of the heterogeneous job should be separated with ":" character. For example:

$ sbatch --cpus-per-task=4 --mem-per-cpu=1  --ntasks=128 : \
         --cpus-per-task=1 --mem-per-cpu=16 --ntasks=1 my.bash

Options specified for one component of a heterogeneous job (or job step) will be used for subsequent components to the extent which is expected to be helpful. Propagated options can be reset as desired for each component (e.g. a different account name could be specified for each job component. For example, --immediate and --job-name are propagated, while --ntasks and --mem-per-cpu are reset to default values for each component. A list of propagated options follows.

  • --account
  • --acctg-freq
  • --begin
  • --checkpoint
  • --checkpoint-dir
  • --cluster-constraint
  • --clusters
  • --comment
  • --deadline
  • --delay-boot
  • --dependency
  • --distribution
  • --epilog (option available only in srun)
  • --error
  • --export
  • --export-file
  • --exclude
  • --get-user-env
  • --gid
  • --hold
  • --ignore-pbs
  • --immediate
  • --input
  • --job-name
  • --kill-on-bad-exit (option available only in srun)
  • --label (option available only in srun)
  • --max-exit-timeout (option available only in srun)
  • --mcs-label
  • --msg-timeout (option available only in srun)
  • --no-allocate (option available only in srun)
  • --no-requeue
  • --nice
  • --no-kill
  • --open-mode (option available only in srun)
  • --output
  • --parsable
  • --priority
  • --profile
  • --propagate
  • --prolog (option available only in srun)
  • --pty (option available only in srun)
  • --qos
  • --quiet
  • --quit-on-interrupt (option available only in srun)
  • --reboot
  • --reservation
  • --requeue
  • --signal
  • --slurmd-debug (option available only in srun)
  • --task-epilog (option available only in srun)
  • --task-prolog (option available only in srun)
  • --time
  • --test-only
  • --time-min
  • --uid
  • --unbuffered (option available only in srun)
  • --verbose
  • --wait
  • --wait-all-nodes
  • --wckey
  • --workdir

The task distribution specification applies separately within each job component. Consider for example a heterogeneous job with each component being allocated 4 CPUs on 2 nodes. In our example, job component zero is allocated 2 CPUs on node "nid00001" and 2 CPUs on node "nid00002". Job component zero is allocated 2 CPUs on node "nid00003" and 2 CPUs on node "nid00004". A task distribution of "cyclic" will distribute the first 4 tasks in a cyclic fashion on nodes "nid00001" and "nid00002", then distribute the next 4 tasks in a cyclic fashion on nodes "nid00003" and "nid00004" as shown below.

Node nid00001Node nid00002Node nid00003Node nid00004
Rank 0Rank 1Rank 4Rank 5
Rank 2Rank 3Rank 6Rank 7

Some options should be specified only in the first job component. For example, specifyihg a batch job output file in the second job component's options will result in the first job component (where the batch script executes) using the default output file name.

Environment variables used to specify default options for the job submit command will be applied to every component of the heterogeneous job (e.g. SBATCH_ACCOUNT).

Batch job options can be included in the submitted script for multiple heterogeneous job components. Each component should be separated by a line containing the line "#SBATCH packjob" as shown below.

$ cat new.bash
#!/bin/bash
#SBATCH --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1
#SBATCH packjob
#SBATCH --cpus-per-task=2 --mem-per-cpu=1g  --ntasks=8

srun run.app

$ sbatch new.bash

Is equivalent to the following:

$ cat my.bash
#!/bin/bash
srun run.app

$ sbatch --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1 : \
         --cpus-per-task=2 --mem-per-cpu=1g  --ntasks=8 my.bash

The batch script will be executed in the first node in the first component of the heterogeneous job. For the above example, that will be the job component with 1 task, 4 CPUs and 64 GB of memory (16 GB for each of the 4 CPUs).

If a heterogeneous job is submitted to run in multiple clusters not part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then the entire job will be sent to the cluster expected to be able to start all components at the earliest time.

A resource limit test is performed when a heterogeneous job is submitted in order to immediately reject jobs that will not be able to start with current limits. The individual components of the heterogeneous job are validated, like all regular jobs. The heterogeneous job as a whole is also tested, but in a more limited fashion with respect to quality of service (QOS) limits. Each component of a heterogeneous job counts as a "job" with respect to resource limits.

Burst Buffers

A burst buffer can either be persistent or linked to a specific job ID (at least the Cray implementation). Since a heterogeneous job consists of multiple job IDs, a job-specific burst buffer will be associated with only one heterogeneous job component. Only a persistent burst buffer can be accessed by all components of a heterogeneous job. A sample batch script demonstrating this for a Cray system is appended.

#!/bin/bash
#SBATCH --nodes=1 --constraint=haswell
#BB create_persistent name=alpha capacity=10 access=striped type=scratch
#DW persistentdw name=alpha
#SBATCH packjob
#SBATCH --nodes=16 --constraint=knl
#DW persistentdw name=alpha
...

NOTE: Cray's DataWarp interface directly reads the job script, but has no knowledge of "Slurm's "packjob" directive, so Slurm internally rebuilds the script for each job component so that only that job components burst buffer directives are included in that script. The batch script first component of the job will be modified in order to replace the "#DW" directives of other job components with "#EXCLUDED DW" to prevent their interpretation by Cray infrastructure. Since the batch script will only be executed by the first job component, the subsequent job components will not include commands from the original script. These scripts are build and managed by Slurm for internal purposes (and visiable from various Slurm commands) from a user script as shown above. An example is shown below:

Rebuilt script for first job component

#!/bin/bash
#SBATCH --nodes=1 --constraint=haswell
#BB create_persistent name=alpha capacity=10 access=striped type=scratch
#DW persistentdw name=alpha
#SBATCH packjob
#SBATCH --nodes=16 --constraint=knl
#EXCLUDED DW persistentdw name=alpha
...


Rebuilt script for second job component

#!/bin/bash
#SBATCH --nodes=16 --constraint=knl
#DW persistentdw name=alpha
exit 0

Managing Jobs

Information maintained in Slurm for a heterogeneous job includes:

  • job_id: Each component of a heterogeneous job will have its own unique job_id.
  • pack_job_id: This identification number applies to all components of the heterogeneous job. All components of the same job will have the same pack_job_id value and it will be equal to the job_id of the first component. We refer to this as the "pack leader".
  • pack_job_id_set: Regular expression identifying all job_id values associated with the job.
  • pack_job_offset: A unique sequence number applied to each component of the heterogeneous job. The first component will have a >pack_job_offset value of 0, the next a value of 1, etc.
job_id pack_job_id pack_job_offset pack_job_id_set
1231230123-127
1241231123-127
1251232123-127
1261233123-127
1271234123-127

Table 1: Example job IDs

The smap, smap, squeue and sview report the components of a heterogeneous job using the format "<pack_job_id>+<pack_job_offset>". For example "123+4" would represent heterogeneous job id 123 and it's fifth component (note: the first component has a pack_job_offset value of 0).

A request for a specific job ID that identifies a ID of the first component of a heterogeneous job (i.e. the "pack leader" will return information about all components of that job. For example:

$ squeue --job=93
JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
 93+0     debug  bash  adam  R 18:18      1 nid00001
 93+1     debug  bash  adam  R 18:18      1 nid00011
 93+2     debug  bash  adam  R 18:18      1 nid00021

A request to cancel or otherwise signal a pack leader will be applied to all components of that pack job. A request to cancel a specific component of the pack job using the "#+#" notation will apply on to that specific component. For example:

$ squeue --job=93
JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
 93+0     debug  bash  adam  R 19:18      1 nid00001
 93+1     debug  bash  adam  R 19:18      1 nid00011
 93+2     debug  bash  adam  R 19:18      1 nid00021
$ scancel 93+1
$ squeue --job=93
JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
 93+0     debug  bash  adam  R 19:38      1 nid00001
 93+2     debug  bash  adam  R 19:38      1 nid00021
$ scancel 93
$ squeue --job=93
JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST

While a heterogeneous job is in pending state, only the entire job can be cancelled rather than it's individual components. A request to cancel an individual component of a heterogeneous job not in pending state will return an error. After the job has begun execution, the individual component can be cancelled.

Email notification for job state changes (the --mail-type option) is only supported for a pack leader. Requests for email notifications for other components of a heterogeneous job will be silently ignored.

Requests to modify an individual component of a job using the scontrol command must specify the job ID with the "#+#" notation. A request to modify a job by specifying the pack_job_id will modify all components of a heterogeneous job. For example:

# Change the account of compnent 2 of heterogeneous job 123:
$ scontrol update jobid=123+2 account=abc

# Change the time limit of all components of heterogeneous job 123:
$ scontrol update jobid=123 timelimit=60

Requests to perform the following operations a job can only be requested for a pack leader and will be applied to all components of that heterogeneous job. Requests to operate on individual components of the heterogeneous will return an error.

  • requeue
  • resume
  • suspend

The sbcast command supports heterogeneous job allocations. By default, sbcast will will copy files to all nodes in the job allocation. The -j/--jobid option can be used to copy files to individual components as shown below.

$ sbcast --jobid=123   data /tmp/data
$ sbcast --jobid=123.0 app0 /tmp/app0
$ sbcast --jobid=123.1 app1 /tmp/app1

The srun commands --bcast option will transfer files to the nodes associated with the application to be launched as specified by the --pack-group option.

Slurm has a configuration option to control behavior of some commands with respect to heterogeneous jobs. By default a request to cancel, hold or release a job ID that is not the pack_job_id, but that of a job component will only operate that one component of the heterogeneous job. If SchedulerParameters configuration parameter includes the option "whole_pack" then the operation would apply to all components of the job if any job component is specified to be operated upon. In the below example, the scancel command will either cancel all components of job 93 if SchedulingParameters=whole_pack is configured, otherwise only job 93+1 will be cancelled. If a specific heterogeneous job component is specified (e.g. "scancel 93+1"), then only that one component will be effected.

$ squeue --job=93
JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
 93+0     debug  bash  adam  R 19:18      1 nid00001
 93+1     debug  bash  adam  R 19:18      1 nid00011
 93+2     debug  bash  adam  R 19:18      1 nid00021
$ scancel 94 (where job ID 94 is equivalent to 93+1)
# Cancel 93+0, 93+1 and 93+2 if SchedulerParameters includes "whole_pack"
# Cancel only 93+1 if SchedulerParameters does not include "whole_pack"

Accounting

Slurm's accounting database records the pack_job_id and pack_job_offset fields. The sacct command reports job's using the format "<pack_job_id>+<pack_job_offset>" and can accept a job ID specification for filtering using the same format. If a pack_job_id value is specified as a job filter, then information about all components of that job will be reported as shown below.

$ sacct -j 67767
  JobID JobName Partition Account AllocCPUS     State ExitCode 
------- ------- --------- ------- --------- --------- -------- 
67767+0     foo     debug    test         2 COMPLETED      0:0 
67767+1     foo     debug    test         4 COMPLETED      0:0 

$  sacct -j 67767+1
  JobID JobName Partition Account AllocCPUS     State ExitCode 
------- ------- --------- ------- --------- --------- -------- 
67767+1     foo     debug    test         4 COMPLETED      0:0 

Launching Applications (Job Steps)

The srun command is used to launch applications. By default, the application is launched only on the first component of a heterogeneous job, but options are available to support different behaviors.

srun's "--pack-group" option defines which pack job component(s) are to have applications launched for them. The --pack-group option takes an expression defining which component(s) are to launch an application for an individual execution of the srun command. The expression can contain one or more component index values in a comma separated list. Ranges of index values can be specified in a hyphen separated list. By default, an application is launched only on component number zero. Some examples follow:

  • --pack-group=2
  • --pack-group=0,4
  • --pack-group=1,3-5

IMPORTANT: The ability to execute a single application across more than one job allocation does not work in Slurm version 17.11 except in the case of OpenMPI with Slurm's mpi/pmi2 plugin or a non-MPI job with Slurm's mpi/none. We expect support for other MPI implementations in a future release of Slurm. Until that time, an appropriate error message will be reported unless Slurm's SchedulerParameters configuration parameter includes the option "enable_hetero_steps".

By default, the applications launched by a single execution of the srun command (even for different components of the heterogeneous job) are combined into one MPI_COMM_WORLD with non-overlapping task IDs.

As with the salloc and sbatch commands, the ":" character is used to separate multiple components of a heterogeneous job. This convention means that the stand-alone ":" character can not be used as an argument to an application launched by srun. This includes the ability to execute different applications and arguments for each job component. If some heterogeneous job component lacks an application specification, the next application specification provided will be used for earlier components lacking one as shown below.

$ srun --label -n2 : -n1 hostname
0: nid00012
1: nid00012
2: nid00013

If multiple srun commands are executed concurrently, this may result in resource contention (e.g. memory limits preventing some job steps components from being allocated resources because of two srun commands executing at the same time). If the srun --pack-group option is used to create multiple job steps (for the different components of a heterogeneous job), those job steps will be created sequentially. When multiple srun commmands execute at the same time, this may result in some step allocations taking place, while others are delayed. Only after all job step allocations have been granted will the application being launched.

All components of a job step will have the same step ID value. If job steps are launched on subsets of the job components there may be gaps in the step ID values for individual job components.

$ salloc -n1 : -n2 beta bash
salloc: Pending job allocation 1721
salloc: Granted job allocation 1721
$ srun --pack-group=0,1 true   # Launches steps 1721.0 and 1722.0
$ srun --pack-group=0 t  rue   # Launches step  1721.1, no 1722.1
$ srun --pack-group=0,1 true   # Launches steps 1721.2 and 1722.2

The maximum pack group specified in a job step allocation (either explicitly specified or implied by the ":" separator) must not exceed the number of pack groups in the job allocation. For example

$ salloc -n1 -C alpha : -n2 -C beta bash
salloc: Pending job allocation 1728
salloc: Granted job allocation 1728
$ srun --pack-group=0,1 hostname
nid00001
nid00008
nid00008
$ srun hostname : date : id
error: Attempt to run a job step with pack group value of 2,
       but the job allocation has maximum value of 1

Environment Variables

Slurm environment variables will be set independently for each component of the job by appending "_PACK_GROUP_" and a sequence number the the usual name. In addition, the "SLURM_JOB_ID" environment variable will contain the job ID of the pack leader and "SLURM_PACK_SIZE" will contain the number of components in the job. For example:

$ salloc -N1 : -N2 bash
salloc: Pending job allocation 11741
salloc: job 11741 queued and waiting for resources
salloc: job 11741 has been allocated resources
$ env | grep SLURM
SLURM_JOB_ID=11741
SLURM_PACK_SIZE=2
SLURM_JOB_ID_PACK_GROUP_0=11741
SLURM_JOB_ID_PACK_GROUP_1=11742
SLURM_NNODES_PACK_GROUP_0=1
SLURM_NNODES_PACK_GROUP_1=2
SLURM_JOB_NODELIST_PACK_GROUP_0=nid00001
SLURM_JOB_NODELIST_PACK_GROUP_1=nid[00011-00012]
...

The various MPI implementations rely heavily upon Slurm environment variables for proper operation. A single MPI applications executing in a single MPI_COMM_WORLD require a uniform set of environment variables that reflect a single job allocation. The example below shows how Slurm sets environment variables for MPI.

$ salloc -N1 : -N2 bash
salloc: Pending job allocation 11741
salloc: job 11751 queued and waiting for resources
salloc: job 11751 has been allocated resources
$ env | grep SLURM
SLURM_JOB_ID=11751
SLURM_PACK_SIZE=2
SLURM_JOB_ID_PACK_GROUP_0=11751
SLURM_JOB_ID_PACK_GROUP_1=11752
SLURM_JOB_NODELIST_PACK_GROUP_0=nid00001
SLURM_JOB_NODELIST_PACK_GROUP_1=nid[00011-00012]
...
$ srun --pack-group=0,1 env | grep SLURM
SLURM_JOB_ID=11751
SLURM_JOB_NODELIST=nid[00001,00011-00012]
...

Examples

Create a heterogeneous resource allocation containing one node with 256GB of memory and a feature of "haswell" plus 2176 cores on 32 nodes with a feature of "knl". Then launch a program called "master" on the "haswell" node and "slave" on the "knl" nodes. Each application will be in its own MPI_COMM_WORLD.

salloc -N1 --mem=256GB -C haswell : \
       -n2176 -N32 --ntasks-per-core=1 -C knl bash
srun master &
srun --pack-group=1 slave &
wait

This variation of the above example launches programs "master" and "slave" in a single MPI_COMM_WORLD.

salloc -N1 --mem=256GB -C haswell : \
       -n2176 -N32 --ntasks-per-core=1 -C knl bash
srun master : slave &

The SLURM_PROCID and SLURM_TASKID environment variables will be set to reflect a global task rank (both environment variables will have the same value). Each spawned process will have a unique SLURM_PROCID.

Similarly, the SLURM_NPROCS and SLURM_NTASKS environment variables will be set to reflect a global task count (both environment variables will have the same value). SLURM_NTASKS will be set to the total count of tasks in all components. Note that the task rank and count values are needed by MPI and typically determined by examining Slurm environment variables.

Limitations

The backfill scheduler has limiations in how it tracks usage of CPUs and memory in the future. This typically requires the backfill scheduler be able to allocate each component of a heterogeneous job on a different node in order to begin its resource allocation, even if multiple compoents of the job do actually get allocated resources on the same node.

In a federation of clusters, a heterogeneous job will execute entirely on the cluster from which the job is submitted. The heterogeneous job will not be eligible to migrate between clusters or to have different components of the job execute on different clusters in the federation.

Job arrays of heterogeneous jobs are not supported.

The srun command's --no-allocate option is not supported for heterogeneous jobs.

Only one job step per heterogeneous job component can be launched by a single srun command (e.g. "srun --pack-group=0 alpha : --pack-group=0 beta" is not supported).

The sattach command can only be used to attach to a single component of a heterogeneous job at a time.

Heterogeneous jobs are only scheduled by the backfill scheduler plugin. The more frequently executed scheduling logic only starts jobs on a first-in first-out (FIFO) basis and lacks logic for concurrently scheduling all components of a heterogeneous job.

Heterogeneous jobs are not supported with Slurm's select/serial plugin.

Heterogeneous jobs are not supported on Cray ALPS systems.

Heterogeneous jobs are not supported on IBM PE systems.

Slurm's PERL APIs currently do not support heterogeneous jobs.

The srun --multi-prog option can not be used to span more than one heterogeneous job component.

The srun --open-mode option is by default set to "append".

Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are dependent upon communication ports being assigned to them by Slurm. Such MPI jobs will experience step launch failure if any component of a heterogeneous job step is unable to acquire the allocated ports. Non-heterogeneous job steps will retry step launch using a new set of communication ports (no change in Slurm behavior).

System Administrator Information

The job submit plugin is invoked independently for each component of a heterogeneous job.

The spank_init_post_opt() function is invoked once for each component of a heterogeneous job. This permits site defined options on a per job component basis.

Scheduling of heterogeneous jobs is performed only by the sched/backfill plugin and all heterogeneous job components are either all scheduled at the same time or deferred. In order to ensure the timely initiation of both heterogeneous and non-heterogeneous jobs, the backfill scheduler alternates between two different modes on each iteration. In the first mode, if a heterogeneous job component can not be initiated immediately, its expected start time is recorded and all subsequent components of that job will be considered for starting no earlier than the latest component's expected start time. In the second mode, all heterogeneous job components will be considered for starting no earlier than the latest component's expected start time. After completion of the second mode, all heterogeneous job expected start time data is cleared and the first mode will be used in the next backfill scheduler iteration. Regular (non-heterogeneous jobs) are scheduled independently on each iteration of the backfill scheduler.

For example, consider a heterogeneous job with three components. When considered as independent jobs, the components could be initiated at times now (component 0), now plus 2 hour (component 1), and now plus 1 hours (component 2). When the backfill scheduler runs in the first mode:

  1. Component 0 will be noted to possible to start now, but not initiated due to the additional components to be initiated
  2. Component 1 will be noted to be possible to start in 2 hours
  3. Component 2 will not be considered for scheduling until 2 hours in the future, which leave some additional resources available for scheduling to other jobs

When the backfill scheduler executes next, it will use the second mode and (assuming no other state changes) all three job components will be considered available for scheduling no earlier than 2 hours in the future, which may allow other jobs to be allocated resources before heterogeneous job component 0 could be initiated.

The heterogeneous job start time data will be cleared before the first mode is used in the next iteration in order to consider system status changes which might permit the heterogeneous to be initiated at an earlier time than previously determined.

A resource limit test is performed when a heterogeneous job is submitted in order to immediately reject jobs that will not be able to start with current limits. The individual components of the heterogeneous job are validated, like all regular jobs. The heterogeneous job as a whole is also tested, but in a more limited fashion with respect to quality of service (QOS) limits. This is due to the complexity of each job component having up to three sets of limits (association, job QOS and partition QOS). Note that successful submission of any job (heterogeneous or otherwise) does not ensure the job will be able to start without exceeding some limit. For example a job's CPU limit test does not consider that CPUs might not be allocated individually, but resource allocations might be performed by whole core, socket or node. Each component of a heterogeneous job counts as a "job" with respect to resource limits.

For example, a user might have a limit of 2 concurrent running jobs and submit a heterogeneous job with 3 components. Such a situation will have an adverse effect upon scheduling other jobs, especially other heterogeneous jobs.

Last modified 21 November 2017