Multifactor Priority Plugin

Contents

Introduction

By default, Slurm assigns job priority on a First In, First Out (FIFO) basis. FIFO scheduling should be configured when Slurm is controlled by an external scheduler.

The PriorityType parameter in the slurm.conf file selects the priority plugin. The default value for this variable is "priority/basic" which enables simple FIFO scheduling. (See Configuration below)

The Multi-factor Job Priority plugin provides a very versatile facility for ordering the queue of jobs waiting to be scheduled.

Multi-factor 'Factors'

There are six factors in the Multi-factor Job Priority plugin that influence job priority:

Age
the length of time a job has been waiting in the queue, eligible to be scheduled
Fair-share
the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed
Job size
the number of nodes or CPUs a job is allocated
Partition
a factor associated with each node partition
QOS
a factor associated with each Quality Of Service
TRES
each TRES Type has it's own factor for a job which represents the number of requested/allocated TRES Type in a given partition

Additionally, a weight can be assigned to each of the above factors. This provides the ability to enact a policy that blends a combination of any of the above factors in any portion desired. For example, a site could configure fair-share to be the dominant factor (say 70%), set the job size and the age factors to each contribute 15%, and set the partition and QOS influences to zero.

Job Priority Factors In General

The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:

Job_priority =
	(PriorityWeightAge) * (age_factor) +
	(PriorityWeightFairshare) * (fair-share_factor) +
	(PriorityWeightJobSize) * (job_size_factor) +
	(PriorityWeightPartition) * (partition_factor) +
	(PriorityWeightQOS) * (QOS_factor) +
	SUM(TRES_weight_cpu * TRES_factor_cpu,
	    TRES_weight_<type> * TRES_factor_<type>,
	    ...)

All of the factors in this formula are floating point numbers that range from 0.0 to 1.0. The weights are unsigned, 32 bit integers. The job's priority is an integer that ranges between 0 and 4294967295. The larger the number, the higher the job will be positioned in the queue, and the sooner the job will be scheduled. A job's priority, and hence its order in the queue, can vary over time. For example, the longer a job sits in the queue, the higher its priority will grow when the age_weight is non-zero.

IMPORTANT: The weight values should be high enough to get a good set of significant digits since all the factors are floating point numbers from 0.0 to 1.0. For example, one job could have a fair-share factor of .59534 and another job could have a fair-share factor of .50002. If the fair-share weight is only set to 10, both jobs would have the same fair-share priority. Therefore, set the weights high enough to avoid this scenario, starting around 1000 or so for those factors you want to make predominant.

Age Factor

The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster's current limits.

At some configurable length of time (PriorityMaxAge), the age factor will max out to 1.0.

Job Size Factor

The job size factor correlates to the number of nodes or CPUs the job has requested. This factor can be configured to favor larger jobs or smaller jobs based on the state of the PriorityFavorSmall boolean in the slurm.conf file. When PriorityFavorSmall is NO, the larger the job, the greater its job size factor will be. A job that requests all the nodes on the machine will get a job size factor of 1.0. When the PriorityFavorSmall Boolean is YES, the single node job will receive the 1.0 job size factor.

The PriorityFlags value of SMALL_RELATIVE_TO_TIME alters this behavior as follows. The job size in CPUs is divided by the time limit in minutes. The result is divided by the total number of CPUs in the system. Thus a full-system job with a time limit of one will receive a job size factor of 1.0, while a tiny job with a large time limit will receive a job size factor close to 0.0.

Partition Factor

Each node partition can be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request to run in this partition. This priority value is then normalized to the highest priority of all the partitions to become the partition factor.

Quality of Service (QOS) Factor

Each QOS can be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS's to become the QOS factor.

TRES Factors

Each TRES Type has its own priority factor for a job which represents the amount of TRES Type requested/allocated in a given partition. For global TRES Types, such as Licenses and Burst Buffers, the factor represents the number of TRES Type requested/allocated in the whole system. The more a given TRES Type is requested/allocated on a job, the greater the job priority will be for that job.

Fair-share Factor

Note: Computing the fair-share factor requires the installation and operation of the Slurm Accounting Database to provide the assigned shares and the consumed, computing resources described below.

The fair-share component to a job's priority influences the order in which a user's queued jobs are scheduled to run based on the portion of the computing resources they have been allocated and the resources their jobs have already consumed. The fair-share factor does not involve a fixed allotment, whereby a user's access to a machine is cut off once that allotment is reached.

Instead, the fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

Slurm's fair-share factor is a floating point number between 0.0 and 1.0 that reflects the shares of a computing resource that a user has been allocated and the amount of computing resources the user's jobs have consumed. The higher the value, the higher is the placement in the queue of jobs waiting to be scheduled.

By default, the computing resource is the computing cycles delivered by a machine in the units of allocated_cpus*seconds. Other resources can be taken into account by configuring a partition's TRESBillingWeights option. The TRESBillingWeights option allows you to account for consumed resources other than just CPUs by assigning different billing weights to different Trackable Resources (TRES) such as CPUs, nodes, memory, licenses and generic resources (GRES). For example, when billing only for CPUs, if a job requests 1CPU and 64GB of memory on a 16CPU, 64GB node the job will only be billed for 1CPU when it really used the whole node.

By default, when TRESBillingWeights is configured, a job is billed for each individual TRES used. The billable TRES is calculated as the sum of all TRES types multiplied by their corresponding billing weight.

For example, the following jobs on a partition configured with TRESBillingWeights=CPU=1.0,Mem=0.25G and 16CPU, 64GB nodes would be billed as:

      CPUs       Mem GB
Job1: (1 *1.0) + (60*0.25) = (1 + 15) = 16
Job2: (16*1.0) + (1 *0.25) = (16+.25) = 16.25
Job3: (16*1.0) + (60*0.25) = (16+ 15) = 31

Another method of calculating the billable TRES is by taking the MAX of the individual TRES' on a node (e.g. cpus, mem, gres) plus the SUM of the global TRES' (e.g. licenses). For example the above job's billable TRES would be calculated as:

          CPUs      Mem GB
Job1: MAX((1 *1.0), (60*0.25)) = 15
Job2: MAX((15*1.0), (1 *0.25)) = 15
Job3: MAX((16*1.0), (64*0.25)) = 16
This method is turned on by defining the MAX_TRES priority flags in the slurm.conf.

Normalized Shares

The fair-share hierarchy represents the portions of the computing resource that have been allocated to multiple projects. These allocations are assigned to an account. There can be multiple levels of allocations made as allocations of a given account are further divided to sub-accounts:


Figure 1. Machine Allocation

The chart above shows the resources of the machine allocated to four accounts, A, B, C and D. Furthermore, account A's shares are further allocated to sub accounts, A1 through A4. Users are granted permission (through sacctmgr) to submit jobs against specific accounts. If there are 10 users given equal shares in Account A3, they will each be allocated 1% of the machine.

A user's normalized shares is simply

S =	(Suser / Ssiblings) *
	(Saccount / Ssibling-accounts) *
	(Sparent / Sparent-siblings) * ...
Where:
S
is the user's normalized share, between zero and one
Suser
are the number of shares of the account allocated to the user
Ssiblings
are the total number of shares allocated to all users permitted to charge the account (including Suser)
Saccount
are the number of shares of the parent account allocated to the account
Ssibling-accounts
are the total number of shares allocated to all sub-accounts of the parent account
Sparent
are the number of shares of the grandparent account allocated to the parent
Sparent-siblings
are the total number of shares allocated to all sub-accounts of the grandparent account

Normalized Usage

The processor*seconds allocated to every job are tracked in real-time. If one only considered usage over a fixed time period, then calculating a user's normalized usage would be a simple quotient:

	UN = Uuser / Utotal
Where:
UN
is normalized usage, between zero and one
Uuser
is the processor*seconds consumed by all of a user's jobs in a given account for over a fixed time period
Utotal
is the total number of processor*seconds utilized across the cluster during that same time period

However, significant real-world usage quantities span multiple time periods. Rather than treating usage over a number of weeks or months with equal importance, Slurm's fair-share priority calculation places more importance on the most recent resource usage and less importance on usage from the distant past.

The Slurm usage metric is based off a half-life formula that favors the most recent usage statistics. Usage statistics from the past decrease in importance based on a single decay factor, D:

	UH = Ucurrent_period +
	     ( D * Ulast_period) + (D * D * Uperiod-2) + ...
Where:
UH
is the historical usage subject to the half-life decay
Ucurrent_period
is the usage charged over the current measurement period
Ulast_period
is the usage charged over the last measurement period
Uperiod-2
is the usage charged over the second last measurement period
D
is a decay factor between zero and one that delivers the half-life decay based off the PriorityDecayHalfLife setting in the slurm.conf file. Without accruing additional usage, a user's UH usage will decay to half its original value after a time period of PriorityDecayHalfLife seconds.

In practice, the PriorityDecayHalfLife could be a matter of seconds or days as appropriate for each site. The decay is recalculated every PriorityCalcPeriod minutes, or 5 minutes by default. The decay factor, D, is assigned the value that will achieve the half-life decay rate specified by the PriorityDecayHalfLife parameter.

The total number of processor*seconds utilized can be similarly aggregated with the same decay factor:

	RH = Rcurrent_period +
	    ( D * Rlast_period) + (D * D * Rperiod-2) + ...
Where:
RH
is the total historical usage subject to the same half-life decay as the usage formula.
Rcurrent_period
is the total usage charged over the current measurement period
Rlast_period
is the total usage charged over the last measurement period
Rperiod-2
is the total usage charged over the second last measurement period
D
is the decay factor between zero and one

A user's normalized usage that spans multiple time periods then becomes:

	U = UH / RH

Simplified Fair-Share Formula

The simplified formula for calculating the fair-share factor for usage that spans multiple time periods and subject to a half-life decay is:

	F = 2**(-U/S/d)
Where:
F
is the fair-share factor
S
is the normalized shares
U
is the normalized usage factoring in half-life decay
d
is the FairShareDampeningFactor (a configuration parameter, default value of 1)

The fair-share factor will therefore range from zero to one, where one represents the highest priority for a job. A fair-share factor of 0.5 indicates that the user's jobs have used exactly the portion of the machine that they have been allocated. A fair-share factor of above 0.5 indicates that the user's jobs have consumed less than their allocated share while a fair-share factor below 0.5 indicates that the user's jobs have consumed more than their allocated share of the computing resources.

The Fair-share Factor Under An Account Hierarchy

The method described above presents a system whereby the priority of a user's job is calculated based on the portion of the machine allocated to the user and the historical usage of all the jobs run by that user under a specific account.

Another layer of "fairness" is necessary however, one that factors in the usage of other users drawing from the same account. This allows a job's fair-share factor to be influenced by the computing resources delivered to jobs of other users drawing from the same account.

If there are two members of a given account, and if one of those users has run many jobs under that account, the job priority of a job submitted by the user who has not run any jobs will be negatively affected. This ensures that the combined usage charged to an account matches the portion of the machine that is allocated to that account.

In the example below, when user 3 submits their first job using account C, they will want their job's priority to reflect all the resources delivered to account B. They do not care that user 1 has been using up a significant portion of the cycles allocated to account B and user 2 has yet to run a job out of account B. If user 2 submits a job using account B and user 3 submits a job using account C, user 3 expects their job to be scheduled before the job from user 2.


Figure 2. Usage Example

The Slurm Fair-Share Formula

The Slurm fair-share formula has been designed to provide fair scheduling to users based on the allocation and usage of every account.

The actual formula used is a refinement of the formula presented above:

	F = 2**(-UE/S)

The difference is that the usage term is effective usage, which is defined as:

	UE = UAchild +
		  ((UEparent - UAchild) * Schild/Sall_siblings)
Where:
UE
is the effective usage of the child user or child account
UAchild
is the actual usage of the child user or child account
UEparent
is the effective usage of the parent account
Schild
is the shares allocated to the child user or child account
Sall_siblings
is the shares allocated to all the children of the parent account

This formula only applies with the second tier of accounts below root. For the tier of accounts just under root, their effective usage equals their actual usage.

Because the formula for effective usage includes a term of the effective usage of the parent, the calculation for each account in the tree must start at the second tier of accounts and proceed downward: to the children accounts, then grandchildren, etc. The effective usage of the users will be the last to be calculated.

Plugging in the effective usage into the fair-share formula above yields a fair-share factor that reflects the aggregated usage charged to each of the accounts in the fair-share hierarchy.

FairShare=parent

It is possible to disable the fairshare at certain levels of the fair share hierarchy by using the FairShare=parent option of sacctmgr. For users and accounts with FairShare=parent the normalized shares and effective usage values from the parent in the hierarchy will be used when calculating fairshare priories.

If all users in an account is configured with FairShare=parent the result is that all the jobs drawing from that account will get the same fairshare priority, based on the accounts total usage. No additional fairness is added based on users individual usage.

Example

The following example demonstrates the effective usage calculations and resultant fair-share factors. (See Figure 3 below.)

The machine's computing resources are allocated to accounts A and D with 40 and 60 shares respectively. Account A is further divided into two children accounts, B with 30 shares and C with 10 shares. Account D is further divided into two children accounts, E with 25 shares and F with 35 shares.

Note: the shares at any given tier in the Account hierarchy do not need to total up to 100 shares. This example shows them totaling up to 100 to make the arithmetic easier to follow in your head.

User 1 is granted permission to submit jobs against the B account. Users 2 and 3 are granted one share each in the C account. User 4 is the sole member of the E account and User 5 is the sole member of the F account.

Note: accounts A and D do not have any user members in this example, though users could have been assigned.

The shares assigned to each account make it easy to determine normalized shares of the machine's complete resources. Account A has .4 normalized shares, B has .3 normalized shares, etc. Users who are sole members of an account have the same number of normalized shares as the account. (E.g., User 1 has .3 normalized shares). Users who share accounts have a portion of the normalized shares based on their shares. For example, if user 2 had been allocated 4 shares instead of 1, user 2 would have had .08 normalized shares. With users 2 and 3 each holding 1 share, they each have a normalized share of 0.05.

Users 1, 2, and 4 have run jobs that have consumed the machine's computing resources. User 1's actual usage is 0.2 of the machine; user 2 is 0.25, and user 4 is 0.25.

The actual usage charged to each account is represented by the solid arrows. The actual usage charged to each account is summed as one goes up the tree. Account C's usage is the sum of the usage of Users 2 and 3; account A's actual usage is the sum of its children, accounts B and C.


Figure 3. Fair-share Example
  • User 1 normalized share: 0.3
  • User 2 normalized share: 0.05
  • User 3 normalized share: 0.05
  • User 4 normalized share: 0.25
  • User 5 normalized share: 0.35

As stated above, the effective usage is computed from the formula:

	UE = UAchild +
		  ((UEparent - UAchild) * Schild/Sall_siblings)

The effective usage for all accounts at the first tier under the root allocation is always equal to the actual usage:

Account A's effective usage is therefore equal to .45. Account D's effective usage is equal to .25.
  • Account B effective usage: 0.2 + ((0.45 - 0.2) * 30 / 40) = 0.3875
  • Account C effective usage: 0.25 + ((0.45 - 0.25) * 10 / 40) = 0.3
  • Account E effective usage: 0.25 + ((0.25 - 0.25) * 25 / 60) = 0.25
  • Account F effective usage: 0.0 + ((0.25 - 0.0) * 35 / 60) = 0.1458

The effective usage of each user is calculated using the same formula:

  • User 1 effective usage: 0.2 + ((0.3875 - 0.2) * 1 / 1) = 0.3875
  • User 2 effective usage: 0.25 + ((0.3 - 0.25) * 1 / 2) = 0.275
  • User 3 effective usage: 0.0 + ((0.3 - 0.0) * 1 / 2) = 0.15
  • User 4 effective usage: 0.25 + ((0.25 - 0.25) * 1 / 1) = 0.25
  • User 5 effective usage: 0.0 + ((.1458 - 0.0) * 1 / 1) = 0.1458

Using the Slurm fair-share formula,

	F = 2**(-UE/S)

the fair-share factor for each user is:

  • User 1 fair-share factor: 2**(-.3875 / .3) = 0.408479
  • User 2 fair-share factor: 2**(-.275 / .05) = 0.022097
  • User 3 fair-share factor: 2**(-.15 / .05) = 0.125000
  • User 4 fair-share factor: 2**(-.25 / .25) = 0.500000
  • User 5 fair-share factor: 2**(-.1458 / .35) = 0.749154

From this example, once can see that users 1,2, and 3 are over-serviced while user 5 is under-serviced. Even though user 3 has yet to submit a job, his/her fair-share factor is negatively influenced by the jobs users 1 and 2 have run.

Based on the fair-share factor alone, if all 5 users were to submit a job charging their respective accounts, user 5's job would be granted the highest scheduling priority.

The sprio utility

The sprio command provides a summary of the six factors that comprise each job's scheduling priority. While squeue has format options (%p and %Q) that display a job's composite priority, sprio can be used to display a breakdown of the priority components for each job. In addition, the sprio -w option displays the weights (PriorityWeightAge, PriorityWeightFairshare, etc.) for each factor as it is currently configured.

Configuration

The following slurm.conf (SLURM_CONFIG_FILE) parameters are used to configure the Multi-factor Job Priority Plugin. See slurm.conf(5) man page for more details.

PriorityType
Set this value to "priority/multifactor" to enable the Multi-factor Job Priority Plugin. The default value for this variable is "priority/basic" which enables simple FIFO scheduling.
PriorityDecayHalfLife
This determines the contribution of historical usage on the composite usage value. The larger the number, the longer past usage affects fair-share. If set to 0 no decay will be applied. This is helpful if you want to enforce hard time limits per association. If set to 0 PriorityUsageResetPeriod must be set to some interval. The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The default value is 7-0 (7 days).
PriorityCalcPeriod
The period of time in minutes in which the half-life decay will be re-calculated. The default value is 5 (minutes).
PriorityUsageResetPeriod
At this interval the usage of associations will be reset to 0. This is used if you want to enforce hard limits of time usage per association. If PriorityDecayHalfLife is set to be 0 no decay will happen and this is the only way to reset the usage accumulated by running jobs. By default this is turned off and it is advised to use the PriorityDecayHalfLife option to avoid not having anything running on your cluster, but if your schema is set up to only allow certain amounts of time on your system this is the way to do it. Applicable only if PriorityType=priority/multifactor. The unit is a time string (i.e. NONE, NOW, DAILY, WEEKLY). The default is NONE.
  • NONE: Never clear historic usage. The default value.
  • NOW: Clear the historic usage now. Executed at startup and reconfiguration time.
  • DAILY: Cleared every day at midnight.
  • WEEKLY: Cleared every week on Sunday at time 00:00.
  • MONTHLY: Cleared on the first day of each month at time 00:00.
  • QUARTERLY: Cleared on the first day of each quarter at time 00:00.
  • YEARLY: Cleared on the first day of each year at time 00:00.
PriorityFavorSmall
A boolean that sets the polarity of the job size factor. The default setting is NO which results in larger node sizes having a larger job size factor. Setting this parameter to YES means that the smaller the job, the greater the job size factor will be.
PriorityMaxAge
Specifies the queue wait time at which the age factor maxes out. The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The default value is 7-0 (7 days).
PriorityWeightAge
An unsigned integer that scales the contribution of the age factor.
PriorityWeightFairshare
An unsigned integer that scales the contribution of the fair-share factor.
PriorityWeightJobSize
An unsigned integer that scales the contribution of the job size factor.
PriorityWeightPartition
An unsigned integer that scales the contribution of the partition factor.
PriorityWeightQOS
An unsigned integer that scales the contribution of the quality of service factor.
PriorityWeightTRES
A list of TRES Types and weights that scales the contribution of each TRES Type's factor.

Note: As stated above, the six priority factors range from 0.0 to 1.0. As such, the PriorityWeight terms may need to be set to a high enough value (say, 1000) to resolve very tiny differences in priority factors. This is especially true with the fair-share factor, where two jobs may differ in priority by as little as .001. (or even less!)

Configuration Example

The following are sample slurm.conf file settings for the Multi-factor Job Priority Plugin.

The first example is for running the plugin applying decay over time to reduce usage. Hard limits can be used in this configuration, but will have less effect since usage will decay over time instead of having no decay over time.

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# 2 week half-life
PriorityDecayHalfLife=14-0

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor

This example is for running the plugin with no decay on usage, thus making a reset of usage necessary.

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

# reset usage after 1 month
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor

Last modified 1 November 2013