Slurm can be configured to support topology-aware resource allocation to optimize job performance. Slurm supports several modes of operation, one to optimize performance on systems with a three-dimensional torus interconnect and another for a hierarchical interconnect. The hierarchical mode of operation supports both fat-tree or dragonfly networks, using slightly different algorithms.
Slurm's native mode of resource selection is to consider the nodes as a one-dimensional array. Jobs are allocated resources on a best-fit basis. For larger jobs, this minimizes the number of sets of consecutive nodes allocated to the job.
Some larger computers rely upon a three-dimensional torus interconnect. The Cray XT and XE systems also have three-dimensional torus interconnects, but do not require that jobs execute in adjacent nodes. On those systems, Slurm only needs to allocate resources to a job which are nearby on the network. Slurm accomplishes this using a Hilbert curve to map the nodes from a three-dimensional space into a one-dimensional space. Slurm's native best-fit algorithm is thus able to achieve a high degree of locality for jobs.
Slurm can also be configured to allocate resources to jobs on a hierarchical network to minimize network contention. The basic algorithm is to identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm. Use of this logic requires a configuration setting of TopologyPlugin=topology/tree.
Note that slurm uses a best-fit algorithm on the currently available resources. This may result in an allocation with more than the optimum number of switches. The user can request a maximum number of leaf switches for the job as well as a maximum time willing to wait for that number using the --switches option with the salloc, sbatch and srun commands. The parameters can also be changed for pending jobs using the scontrol and squeue commands.
At some point in the future Slurm code may be provided to gather network topology information directly. Now the network topology information must be included in a topology.conf configuration file as shown in the examples below. The first example describes a three level switch in which each switch has two children. Note that the SwitchName values are arbitrary and only used for bookkeeping purposes, but a name must be specified on each line. The leaf switch descriptions contain a SwitchName field plus a Nodes field to identify the nodes connected to the switch. Higher-level switch descriptions contain a SwitchName field plus a Switches field to identify the child switches. Slurm's hostlist expression parser is used, so the node and switch names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]" and "Switches=s[0-2,4-8,12]" will parse fine).
An optional LinkSpeed option can be used to indicate the relative performance of the link. The units used are arbitrary and this information is currently not used. It may be used in the future to optimize resource allocations.
The first example shows what a topology would look like for an eight node cluster in which all switches have only two children as shown in the diagram (not a very realistic configuration, but useful for an example).
# topology.conf # Switch Configuration SwitchName=s0 Nodes=tux[0-1] SwitchName=s1 Nodes=tux[2-3] SwitchName=s2 Nodes=tux[4-5] SwitchName=s3 Nodes=tux[6-7] SwitchName=s4 Switches=s[0-1] SwitchName=s5 Switches=s[2-3] SwitchName=s6 Switches=s[4-5]
The next example is for a network with two levels and each switch has four connections.
# topology.conf # Switch Configuration SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900 SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900 SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900 SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800 SwitchName=s4 Switches=s[0-3] LinkSpeed=1800 SwitchName=s5 Switches=s[0-3] LinkSpeed=1800 SwitchName=s6 Switches=s[0-3] LinkSpeed=1800 SwitchName=s7 Switches=s[0-3] LinkSpeed=1800
As a practical matter, listing every switch connection definitely results in a slower scheduling algorithm for Slurm to optimize job placement. The application performance may achieve little benefit from such optimization. Listing the leaf switches with their nodes plus one top level switch should result in good performance for both applications and Slurm. The previous example might be configured as follows:
# topology.conf # Switch Configuration SwitchName=s0 Nodes=tux[0-3] SwitchName=s1 Nodes=tux[4-7] SwitchName=s2 Nodes=tux[8-11] SwitchName=s3 Nodes=tux[12-15] SwitchName=s4 Switches=s[0-3]
Note that compute nodes on switches that lack a common parent switch can be used, but no job will span leaf switches without a common parent (unless the TopologyParam=TopoOptional option is used). For example, it is legal to remove the line "SwitchName=s4 Switches=s[0-3]" from the above topology.conf file. In that case, no job will span more than four compute nodes on any single leaf switch. This configuration can be useful if one wants to schedule multiple physical clusters as a single logical cluster under the control of a single slurmctld daemon.
If you have nodes that are in separate networks and are associated with unique switches in your topology.conf file, it's possible that you could get in a situation where a job isn't able to run. If a job requests nodes that are in the different networks, either by requesting the nodes directly or by requesting a feature, the job will fail because the requested nodes can't communicate with each other. We recommend placing nodes in separate network segments in disjoint partitions.
For systems with a dragonfly network, configure Slurm with TopologyPlugin=topology/tree plus TopologyParam=dragonfly. If a single job can not be entirely placed within a single network leaf switch, the job will be spread across as many leaf switches as possible in order to optimize the job's network bandwidth.
NOTE: When using the topology/tree plugin, Slurm identifies the network switches which provide the best fit for pending jobs. If nodes have a Weight defined, this will override the resource selection based on network topology. If optimizing resource selection by node weight is more important than optimizing network topology then do NOT use the topology/tree plugin.
NOTE: The topology.conf file for an Infiniband switch can be
automatically generated using the slurmibtopology tool found here:
NOTE: The topology.conf file for an Omni-Path (OPA) switch can
be automatically generated using the opa2slurm tool found here:
For use with the topology/tree plugin, user can also specify the maximum number of leaf switches to be used for their job with the maximum time the job should wait for this optimized configuration. The syntax for this option is "--switches=count[@time]". The system administrator can limit the maximum time that any job can wait for this optimized configuration using the SchedulerParameters configuration parameter with the max_switch_wait option.
If the topology/tree plugin is used, two environment variables will be set to describe that job's network topology. Note that these environment variables will contain different data for the tasks launched on each node. Use of these environment variables is at the discretion of the user.
SLURM_TOPOLOGY_ADDR: The value will be set to the names network switches which may be involved in the job's communications from the system's top level switch down to the leaf switch and ending with node name. A period is used to separate each hardware component name.
SLURM_TOPOLOGY_ADDR_PATTERN: This is set only if the system has the topology/tree plugin configured. The value will be set component types listed in SLURM_TOPOLOGY_ADDR. Each component will be identified as either "switch" or "node". A period is used to separate each hardware component type.
Last modified 28 June 2023