Slurm User and Administrator Guide for Cray Systems using Alps

User Guide

If you have a Cray XC you might want to consider running Slurm Natively.

This document describes the unique features of Slurm on Cray computers on top of ALPS. You should be familiar with the Slurm's mode of operation on Linux clusters before studying the differences in Cray system operation described in this document.

Slurm is designed to operate as a job scheduler over Cray's Application Level Placement Scheduler (ALPS). Use Slurm's sbatch or salloc commands to create a resource allocation in ALPS. Then use ALPS' aprun command to launch parallel jobs within the resource allocation. The resource allocation is terminated once the batch script or the salloc command terminates. Slurm includes a launch/aprun plugin that allow users to use srun to wrap aprun and translate srun options into the equivalent aprun options. Not all srun options can be translated so there are options that are not available. The srun option --launcher-opts= can be used to specify aprun options which lack an equivalent within srun. For example, srun --launcher-opts="-a xt" -n 4 a.out. Since aprun is used to launch tasks (the equivalent of a Slurm job step), the job steps will not be visible using Slurm commands. Other than Slurm's srun command being replaced by aprun and the job steps not being visible, all other Slurm commands will operate as expected. srun can also print the translated aprun line as well with the --launch-cmd option.

Node naming and node geometry on Cray XT/XE systems

Slurm node names will be of the form "nid#####" where "#####" is a five-digit sequence number. Other information available about the node are it's XYZ coordinate in the node's NodeAddr field and it's component label in the HostNodeName field. The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order: cabinet, row, cage, blade or slot, and node. For example "c0-1c2s5n3" is cabinet 0, row 1, cage 2, slot 5 and node 3.

Cray XT/XE systems come with a 3D torus by default. On smaller systems the cabling in X dimension is omitted, resulting in a two-dimensional torus (1 x Y x Z). On Gemini/XE systems, pairs of adjacent nodes (nodes 0/1 and 2/3 on each blade) share one network interface each. This causes the same Y coordinate to be assigned to those nodes, so that the number of distinct torus coordinates is half the number of total nodes.

The Slurm smap and sview tools can visualize node torus positions. Clicking on a particular node shows its NodeAddr field, which is its (X,Y,Z) torus coordinate base-36 encoded as a 3-character string. For example, a NodeAddr of '07A' corresponds to the coordinates X = 0, Y = 7, Z = 10. The NodeAddr of a node can also be shown using 'scontrol show node nid#####'.

Please note that the sbatch/salloc options "--geometry" and "--no-rotate" are BlueGene-specific and have no impact on Cray systems. Topological node placement depends on what Cray makes available via the ALPS_NIDORDER configuration option (see below).

Specifying thread depth

For threaded applications, use the --cpus-per-task/-c parameter of sbatch/salloc to set the thread depth per node. This corresponds to mppdepth in PBS and to the aprun -d parameter. Please note that Slurm does not set the OMP_NUM_THREADS environment variable. Hence, if an application spawns 4 threads, an example script would look like

 #SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS"
 #SBATCH --ntasks=3
 #SBATCH -c 4
 aprun -n 3 -d $OMP_NUM_THREADS ./my_exe

Specifying number of tasks per node

Slurm uses the same default as ALPS, assigning each task to a single core/CPU. In order to make more resources available per task, you can reduce the number of processing elements per node (aprun -N parameter, mppnppn in PBS) with the --ntasks-per-node option of sbatch/salloc. This is in particular necessary when tasks require more memory than the per-CPU default.

Specifying per-task memory

In Cray terminology, a task is also called a "processing element" (PE), hence below we refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory requested through the batch system corresponds to the aprun -m parameter.

Due to the implicit default assumption that 1 task runs per core/CPU, the default memory available per task is the per-CPU share of node_memory / number_of_cores. For example, on a XT5 system with 16000MB per 12-core node, the per-CPU share is 1333MB.

If nothing else is specified, the --mem option to sbatch/salloc can only be used to reduce the per-PE memory below the per-CPU share. This is also the only way that the --mem-per-cpu option can be applied (besides, the --mem-per-cpu option is ignored if the user forgets to set --ntasks/-n). Thus, the preferred way of specifying memory is the more general --mem option.

To increase the per-PE memory settable via the --mem option requires making more per-task resources available using the --ntasks-per-node option to sbatch/salloc. This allows --mem to request up to node_memory / ntasks_per_node megabytes.

When --ntasks-per-node is 1, the entire node memory may be requested by the application. Setting --ntasks-per-node to the number of cores per node yields the default per-CPU share minimum value.

For all cases in between these extremes, set --mem=per_task_node or --mem-per-cpu=memory_per_cpu (node CPU count and task count may differ) and

   --ntasks-per-node=floor(node_memory / per_task_memory)

whenever per_task_memory needs to be larger than the per-CPU share.

Example: An application with 64 tasks needs 7500MB per task on a cluster with 32000MB and 24 cores per node. Hence ntasks_per_node = floor(32000/7500) = 4.

    #SBATCH --comment="requesting 7500MB per task on 32000MB/24-core nodes"
    #SBATCH --ntasks=64
    #SBATCH --ntasks-per-node=4
    #SBATCH --mem=30000

If you would like to fine-tune the memory limit of your application, you can set the same parameters in a salloc session and then check directly, using

    apstat -rvv -R $BASIL_RESERVATION_ID

to see how much memory has been requested.

Using aprun -B

CLE 3.x allows a nice aprun shortcut via the -B option, which reuses all the batch system parameters (--ntasks, --ntasks-per-node, --cpus-per-task, --mem) at application launch, as if the corresponding (-n, -N, -d, -m) parameters had been set; see the aprun(1) manpage on CLE 3.x systems for details.

Node ordering options

Slurm honors the node ordering policy set for Cray's Application Level Placement Scheduler (ALPS). Node ordering is a configurable system option (ALPS_NIDORDER in /etc/sysconfig/alps). The current setting is reported by 'apstat -svv' (look for the line starting with "nid ordering option") and can not be changed at runtime. The resulting, effective node ordering is revealed by 'apstat -no' (if no special node ordering has been configured, 'apstat -no' shows the same order as 'apstat -n').

Slurm uses exactly the same order as 'apstat -no' when selecting nodes for a job. With the --contiguous option to sbatch/salloc you can request a contiguous (relative to the current ALPS nid ordering) set of nodes. Note that on a busy system there is typically more fragmentation, hence it may take longer (or even prove impossible) to allocate contiguous sets of a larger size.

Cray/ALPS node ordering is a topic of ongoing work, some information can be found in the CUG-2010 paper "ALPS, Topology, and Performance" by Carl Albing and Mark Baker.

Other Command Differences

On Cray systems, all signals sent to the job using the scancel command except SIGCHLD, SIGCONT, SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU, SIGURG, or SIGWINCH cause the ALPS reservation to be released. The job however will not be terminated except in the case of SIGKILL and may then be used for post processing.

The sstat command is not supported on Cray systems.


Users may specify GPU memory required per node using the --gres=gpu_mem:# option to any of the commands used to create a job allocation/reservation.

Front-End Node Use

If you want to be allocated resources on a front-end node and no compute nodes (typically used for pre- or post-processing functionality) then submit a batch job with a node count specification of zero.

sbatch -N0 pre_process.bash

Note: Job allocations with zero compute nodes can only be made in Slurm partitions explicitly configured with MinNodes=0 (the default minimum node count for a partition is one compute node).

External Node Use

Slurm interactive jobs are not supported from external nodes, however batch job submissions and all of the other commands will work. If desired, the sbatch command can be used to submit a batch job that creates an xterm on an external node.

srun options translations on a Cray

The following srun options are translated to these aprun options. srun options not listed below have no equivalent aprun translation and while the option can be used for the allocation within Slurm it will not be propagated to aprun.

srun option
aprun option
-c, --cpus-per-task
-s, --oversubscribe
-F share
-F exclusive
-w, --nodelist
-n, --ntasks
-Q, --quiet
-t, --time

The following aprun options have no equivalent in srun and must be specified by using the srun --launcher-opts
options: -a, -b, -B, -cc, -f, -r, and -sl.

Administrator Guide

Install supporting RPMs

The build requires a few -devel RPMs listed below. You can obtain these from SuSe/Novell.

  • CLE 2.x uses SuSe SLES 10 packages (RPMs may be on the normal ISOs)
  • CLE 3.x uses Suse SLES 11 packages (RPMs are on the SDK ISOs, there are two SDK ISO files for SDK)

You can check by logging onto the boot node and running

boot: # xtopview
default: # rpm -qa

The list of packages that should be installed is:

  • mysql-devel (this should be on the Cray ISO)

All Cray-specific PrgEnv and compiler modules should be removed and root privileges will be required to install these files.

Install Munge

Note the Munge installation process on Cray systems differs somewhat from that described in the MUNGE Installation Guide.

Munge is the authentication daemon and needed by Slurm. You can get Munge RPMs from Cray. Use the below method to install and test it. The Cray Munge RPM installs Munge in /opt/munge.

If needed copy the RPMs over to the boot node

login: # scp munge-*.rpm root@boot:/rr/current/software

Install the RPMs on the boot node. While this process creates a Munge key, it can't use the /etc/munge directory. So we make a /opt/munge/key directory instead and create a key there.

boot: # xtopview
default: # rpm -ivh /software/munge-*.x86_64.rpm
default: # mkdir /opt/munge/key
default: # dd if=/dev/urandom bs=1 count=1024 >/opt/munge/key/munge.key
default: # chmod go-rxw /opt/munge/key/munge.key
default: # chown daemon /opt/munge/key/munge.key
default: # perl -pi -e 's/#DAEMON_ARGS=/DAEMON_ARGS="--key-file \/opt\/munge\/key\/munge.key"/g' /etc/init.d/munge
default: # exit

Configure Munge

The following steps apply to each login node and the sdb, where

  • The slurmd or slurmctld daemon will run and/or
  • Users will be submitting jobs

sdb: # mkdir --mode=0711 -vp /var/lib/munge
sdb: # mkdir --mode=0700 -vp /var/log/munge
sdb: # mkdir --mode=0755 -vp /var/run/munge
sdb: # chown daemon /var/lib/munge
sdb: # chown daemon /var/log/munge
sdb: # chown daemon /var/run/munge
sdb: # /etc/init.d/munge start

Start the Munge daemon and test it.

login: # export PATH=/opt/munge/bin:$PATH
login: # munge -n
login: # munge -n | unmunge

When done, verify network connectivity by executing the following (the Munged daemon must be started on the other-login-host as well):

  • munge -n | ssh other-login-host /opt/munge/bin/unmunge

Enable the Cray job service

This is a common dependency on Cray systems. ALPS relies on the Cray job service to generate cluster-unique job container IDs (PAGG IDs). These identifiers are used by ALPS to track running (aprun) job steps. The default (session IDs) is not unique across multiple login nodes. This standard procedure is described in chapter 9 of S-2393 and takes only two steps, both to be done on all 'login' class nodes (xtopview -c login):

  • make sure that the /etc/init.d/job service is enabled (chkconfig) and started
  • enable the module from /opt/cray/job/default in /etc/pam.d/common-session
    (NB: the default is very verbose, a simpler and quieter variant is provided in contribs/cray.)

The latter step is required only if you would like to run interactive salloc sessions.

boot: # xtopview -c login
login: # chkconfig job on
login: # emacs -nw /etc/pam.d/common-session
(uncomment/add the line)
session optional /opt/cray/job/default/lib64/security/
login: # exit
boot: # xtopview -n 31
node/31:# chkconfig job on
node/31:# emacs -nw /etc/pam.d/common-session
(uncomment/add the line as shown above)

Install and Configure Slurm

Slurm can be built and installed as on any other computer as described Quick Start Administrator Guide. You can also get current Slurm RPMs from Cray. An installation process for the RPMs is described below. The Cray Slurm RPMs install in /opt/slurm.

NOTE: By default neither the salloc command or srun command can be executed as a background process. This is done for two reasons:

  1. Only one ALPS reservation can be created from each session ID. The salloc command can not change it's session ID without disconnecting itself from the terminal and its parent process, meaning the process could not be later put into the foreground or easily identified
  2. To better identify every process spawned under the salloc process using terminal foreground process group IDs

You can optionally enable salloc and srun to execute as background processes by using the configure option "--enable-salloc-background" (or the .rpmmacros option "%_with_salloc_background 1"), however doing will result in failed resource allocations (error: Failed to allocate resources: Requested reservation is in use) if not executed sequentially and increase the likelihood of orphaned processes. Specifically request this version when requesting RPMs from Cray as this is not on by default.

If needed copy the RPMs over to the boot node.

login: # scp slurm-*.rpm root@boot:/rr/current/software

Install the RPMs on the boot node.

boot: # xtopview
default: # rpm -ivh /software/slurm-*.x86_64.rpm
edit /etc/slurm/slurm.conf and /etc/slurm/cray.conf
default: # exit

When building Slurm's slurm.conf configuration file, use the NodeName parameter to specify all batch nodes to be scheduled. If nodes are defined in ALPS, but not defined in the slurm.conf file, a complete list of all batch nodes configured in ALPS will be logged by the slurmctld daemon when it starts. One would typically use this information to modify the slurm.conf file and restart the slurmctld daemon. Note that the NodeAddr and NodeHostName fields should not be configured, but will be set by Slurm using data from ALPS. NodeAddr be set to the node's XYZ coordinate and be used b. Slurm's smap and sview commands. NodeHostName will be set to the node's component label. The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order: cabinet, row, cage, blade or slot, and node. For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.

The slurmd daemons will not execute on the compute nodes, but will execute on one or more front end nodes. It is from here that batch scripts will execute aprun commands to launch tasks. This is specified in the slurm.conf file by using the FrontendName and optionally the FrontEndAddr fields as seen in the examples below.

Note tha. Slurm will by default kill running jobs when a node goes DOWN, while a DOWN node in ALPS only prevents new jobs from being scheduled on the node. To help avoid confusion, we recommend that SlurmdTimeout in the slurm.conf file be set to the same value as the suspectend parameter in ALPS' nodehealth.conf file.

You need to specify the appropriate resource selection plugin (the SelectType option in Slurm's slurm.conf configuration file). Configure SelectType to select/alps The select/alps plugin provides an interface to ALPS plus issues calls to the select/linear, which selects resources for jobs using a best-fit algorithm to allocate whole nodes to jobs (rather than individual sockets, cores or threads). In versions of Slurm before 14.03 use select/cray. In 14.03 the plugin name was changed to select/alps to allow for Native Slurm on a Cray.

If you are experiencing performance issues with many jobs you may consider using the slurm.conf option SchedulerParameters=inventory_interval=# option. On a Cray system using Slurm on top of ALPS this limits the amount of times a Basil Inventory call is made. Normally this call happens every scheduling consideration to attempt to close a node state change window with respects to what ALPS has. This call is rather slow, so making it less frequently improves performance dramatically, but in the situation where a node changes state the window is as large as this setting. In an HTC environment this setting is a must and we advise around 10 seconds.

Note that the system topology is based upon information gathered from the ALPS database and is based upon the ALPS_NIDORDER configuration in /etc/sysconfig/alps. Excerpts of a slurm.conf file for use on a Cray systems follow:

# Slurm USER
# Slurm user on cray systems must be root
# This requirement derives from Cray ALPS:
# - ALPS reservations can only be created by the job owner or root
#   (confirmation may be done by other non-privileged users)
# - Freeing a reservation always requires root privileges

# Network topology (handled internally by ALPS)

# Scheduling

# Node selection: use the special-purpose "select/alps" plugin.
# Internally this uses select/linear, i.e. nodes are always allocated
# in units of nodes (other allocation is currently not possible, since
# ALPS does not yet allow to run more than 1 executable on the same
# node, see aprun(1), section LIMITATIONS).
# Add CR_memory as parameter to support --mem/--mem-per-cpu.
# GPU memory allocation supported as generic resource.
# NOTE: No gres/gpu_mem plugin is required, only generic Slurm GRES logic.

# Proctrack plugin: only/default option is proctrack/sgi_job
# ALPS requires cluster-unique job container IDs and thus the /etc/init.d/job
# service needs to be started on all slurmd and login nodes, as described in
# S-2393, chapter 9. Due to this requirement, ProctrackType=proctrack/sgi_job
# is the default on Cray and need not be specified explicitly.

# slurmd spool directories (using %n for Slurm front end node name)

# main logfile
# slurmd logfiles (using %n for Slurm node name)

# PIDs

# Return DOWN nodes to service when e.g. slurmd has been unresponsive

# Configure the suspectend parameter in ALPS' nodehealth.conf file to the same
# value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600")

# Controls how a node's configuration specifications in slurm.conf are
# used.
# 0 - use hardware configuration (must agree with slurm.conf)
# 1 - use slurm.conf, nodes with fewer resources are marked DOWN
# 2 - use slurm.conf, but do not mark nodes down as in (1)

# Per-node configuration for PALU AMD G34 dual-socket "Magny Cours"
# Compute Nodes. We deviate from slurm's idea of a physical socket
# here, since the Magny Cours hosts two NUMA nodes each, which is
# also visible in the ALPS inventory (4 Segments per node, each
# containing 6 'Processors'/Cores).
# Also specify that 2 GB of GPU memory is available on every node
NodeName=DEFAULT Sockets=4 CoresPerSocket=6 ThreadsPerCore=1
NodeName=DEFAULT RealMemory=32000 State=UNKNOWN
NodeName=DEFAULT Gres=gpu_mem:2g

# List the nodes of the compute partition below (service nodes are not
# allowed to appear)

# Frontend nodes: these should not be available to user logins, but
#                 have all filesystems mounted that are also
#                 available on a login node (/scratch, /home, ...).

# Enforce the use of associations: {associations, limits, wckeys}

# Do not propagate any resource limits from the user's environment to
# the slurmd

# Resource limits for memory allocation:
# * the Def/Max 'PerCPU' and 'PerNode' variants are mutually exclusive;
# * use the 'PerNode' variant for both default and maximum value, since
#   - slurm will automatically adjust this value depending on
#     --ntasks-per-node
#   - if using a higher per-cpu value than possible, salloc will just
#     block.
# XXX replace both values below with your values from 'xtprocadmin -A'

# defaults common to all partitions
PartitionName=DEFAULT Nodes=nid00[002-013,018-159,162-173,178-189]
PartitionName=DEFAULT MaxNodes=178
PartitionName=DEFAULT OverSubscribe=EXCLUSIVE State=UP DefaultTime=60

# "User Support" partition with a higher priority
PartitionName=usup Hidden=YES PriorityTier=10 MaxTime=720 AllowGroups=staff

# normal partition available to all users
PartitionName=day Default=YES PriorityTier=1 MaxTime=01:00:00

Slurm supports an optional cray.conf file containing Cray-specific configuration parameters. This file is NOT needed for production systems, but is provided for advanced configurations. If used, cray.conf must be located in the same directory as the slurm.conf file. Configuration parameters supported by cray.conf are listed below.

Communication protocol version number to be used between Slurm and ALPS/BASIL. The default value is BASIL's response to the ENGINE query. Use with caution: Changes in ALPS communications which are not recognized by Slurm could result in loss of jobs. Currently supported values include 1.1, 1.2.0, 1.3.0, 3.1.0, 4.0, 4.1.0, 5.0.0, 5.0.1, 5.1.0 or "latest". A value of "latest" will use the most current version of Slurm's logic and can be useful for validation with new versions of ALPS.
Fully qualified pathname to the apbasil command. The default value is /usr/bin/apbasil.
Fully qualified pathname to the apkill command. The default value is /usr/bin/apkill.
Name of the ALPS database. The default value is XTAdmin.
Hostname of the database server. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'sdb'.
Password used to access the ALPS database. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'basic'.
Port used to access the ALPS database. The default value is 0.
Name of user used to access the ALPS database. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'basic'.

# Example cray.conf file

One additional configuration script can be used to insure that the slurmd daemons execute with the highest resource limits possible, overriding default limits on Suse systems. Depending upon what resource limits are propagated from the user's environment, lower limits may apply to user jobs, but this script will insure that higher limits are possible. Copy the file contribs/cray/etc_sysconfig_slurm into /etc/sysconfig/slurm for these limits to take effect. This script is executed from /etc/init.d/slurm, which is typically executed to start the Slurm daemons. An excerpt of contribs/cray/etc_sysconfig_slurm is shown below.

# /etc/sysconfig/slurm for Cray XT/XE systems
# Cray is SuSe-based, which means that ulimits from
# /etc/security/limits.conf will get picked up any time Slurm is
# restarted e.g. via pdsh/ssh. Since Slurm respects configured limits,
# this can mean that for instance batch jobs get killed as a result
# of configuring CPU time limits. Set sane start limits here.
# Values were taken from pam-1.1.2 Debian package
ulimit -t unlimited	# max amount of CPU time in seconds
ulimit -d unlimited	# max size of a process's data segment in KB

Slurm will ignore any interactive jobs or nodes in interactive mode so set all your nodes to batch from any service node. Dropping the -n option will make all nodes batch.

# xtprocadmin -k m batch -n NODEIDS

Now create the needed directories for logs and state files then start the daemons on the sdb and login nodes as shown below.

sdb: # mkdir -p /ufs/slurm/log
sdb: # mkdir -p /ufs/slurm/spool
sdb: # module load slurm
sdb: # /etc/init.d/slurm start
login: # /etc/init.d/slurm start

Cluster Compatibility Mode

It is possible to use Slurm to allocate resources for jobs that run in Cray's Cluster Compatibility Mode (CCM). In order to set this up, first install the CCM packages according to Cray documentation. As part of the CCM packages, a prologue and epilogue script will be installed. Add/change the following flags in slurm.conf to enable the scripts:


To run jobs in CCM, a Slurm partition should be created in slurm.conf. The name of that partition should be put into the CCM_QUEUES variable in this file: /etc/opt/cray/ccm/ccm.conf

Any job that requests resources from said partition will then run in Cluster Compatibility Mode.

launch/aprun plugin configuration

By default the launch plugin on a Cray is on set to launch/aprun. Nothing extra is needed to enable it.

Node State

Slurm gets node state information from ALPS. Use the Cray xtprocadmin command to set node state up or down. If a node state is down in Slurm and setting it back up using the xtprocadmin command fails, it may be necessary to destroy the node_state* files in your StateSaveLocation directory (as configured in Slurm), which will remove all of Slurm's node state information and force Slurm to rely completely upon ALPS for all node state information. Stop the slurmctld daemon, delete the files, and restart the daemon.

Last modified 31 March 2016