Select Plugin Design Guide

Overview

The select plugin is responsible for selecting compute resources to be allocated to a job, plus allocating and deallocating those resources. The select plugin is aware of the systems topology, based upon data structures established by the topology plugn. It can also over-subscribe resources to support gang scheduling (time slicing of parallel jobs), if so configured. The select plugin is also capable of communicating with an external entity to perform these actions (the select/bluegene plugin used on an IBM BlueGene/Q and the select/cray plugin used with Cray ALPS/BASIL software are two examples). Other architectures would rely upon either the select/linear or select/cons_res plugin. The select/linear plugin allocates whole nodes to jobs and is the simplest implementation. The select/cons_res plugin (cons_res is an abbreviation for consumable resources) can allocate individual sockets, cores, threads, or CPUs within a node. The select/cons_res plugin is slightly slower than select/linear, but contains far more complex logic.

Mode of Operation

The select/linear and select/cons_res plugins have similar modes of operation. The obvious difference is that data structures in select/linear are node-centric, while those in select/cons_res contain information at a finer resolution (sockets, cores, threads, or CPUs depending upon the SelectTypeParameters configuration parameter). The description below is generic and applies to both plugin implementations. Note that both plugins are able to manage memory allocations. Both plugins are also able to manage generic resource (GRES) allocations, making use of the GRES plugins.

Per node data structures include memory (configured and allocated), GRES (configured and allocated, in a List data structure), plus a flag indicating if the node has been allocated using an exclusive option (preventing other jobs from being allocated resources on that same node). The other key data structure is used to enforce the per-partition OverSubscribe configuration parameter and tracks how many jobs have been allocated each compute resource (e.g. CPU) in each partition. This data structure is different between the plugins based upon the resolution of the resource allocation (e.g. nodes or CPUs).

Most of the logic in the select plugin is dedicated to identifying resources to be allocated to a new job. Input to that function includes: a pointer to the new job, a bitmap identifying nodes which could be used, node counts (minimum, maximum, and desired), a count of how many jobs of that partition the job can share resources with, and a list of jobs which can be preempted to initiate the new job. The first phase is to determine of all usable nodes, which nodes would best satisfy the resource requirement. This consists of a best-fit algorithm that groups nodes based upon network topology (if the topology/tree plugin is configured) or based upon consecutive nodes (by default). Once the best nodes are identified, resources are accumulated for the new job until its resource requirements are satisfied.

If the job can not be started with currently available resources, the plugin will attempt to identify jobs which can be preempted in order to initiate the new job. A copy of the current system state will be created including details about all resources and active jobs. Preemptable jobs will then be removed from this simulated system state until the new job can be initiated. When sufficient resources are available for the new job, the jobs actually needing to be preempted for its initiation will be preempted (this may be a subset of the jobs whose preemption is simulated).

Other functions exist to support suspending jobs, resuming jobs, terminating jobs, expanding/shrinking job allocations, un/packing job state information, un/packing node state information, etc. The operation of those functions is relatively straightforward and not detailed here.

Operation on IBM BlueGene/Q Systems

On IBM BlueGene systems, Slurm's slurmd daemon executes on the front-end nodes rather than the compute nodes and IBM provides a Bridge API to manage compute nodes and jobs. The IBM BlueGene systems also have very specific topology rules for what resources can be allocated to a job. Slurm's interface to IBM's Bridge API and the topology rules are found within the select/bluegene plugin and very little BlueGene-specific logic in Slurm is found outside of that plugin. Note that the select/bluegene plugin is required for BlueGene/Q systems.

Operation on Cray Systems

The operation of the select/cray plugin is unique in that it does not directly select resources for a job, but uses the select/linear plugin for that purpose. It also interfaces with Cray's ALPS software using the BASIL interface or directly using the database. On Cray systems, Slurm's slurmd daemon executes on the front-end nodes rather than the compute nodes and ALPS is the mechanism available for Slurm to manage compute nodes and their jobs.

           -------------------
           |   select/cray   |
           -------------------
              |           |
-----------------   --------------
| select/linear |   | BASIL/ALPS |
-----------------   --------------

Last modified 31 March 2016