Slurm Troubleshooting Guide

This guide is meant as a tool to help system administrators or operators troubleshoot Slurm failures and restore services. The Frequently Asked Questions document may also prove useful.

Slurm is not responding

  1. Execute "scontrol ping" to determine if the primary and backup controllers are responding.
  2. If it responds for you, this could be a networking or configuration problem specific to some user or node in the cluster.
  3. If not responding, directly login to the machine and try again to rule out network and configuration problems.
  4. If still not responding, check if there is an active slurmctld daemon by executing "ps -el | grep slurmctld".
  5. If slurmctld is not running, restart it (typically as user root using the command "/etc/init.d/slurm start"). You should check the log file (SlurmctldLog in the slurm.conf file) for an indication of why it failed. If it keeps failing, you should contact the slurm team for help at slurm-dev@schedmd.com.
  6. If slurmctld is running but not responding (a very rare situation), then kill and restart it (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm start").
  7. If it hangs again, increase the verbosity of debug messages (increase SlurmctldDebug in the slurm.conf file) and restart. Again check the log file for an indication of why it failed. At this point, you should contact the slurm team for help at slurm-dev@schedmd.com.
  8. If it continues to fail without an indication as to the failure mode, restart without preserving state (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm startclean"). Note: All running jobs and other state information will be lost.

Jobs are not getting scheduled

This is dependent upon the scheduler used by Slurm. Executing the command "scontrol show config | grep SchedulerType" to determine this. For any scheduler, you can check priorities of jobs using the command "scontrol show job".

  • If the scheduler type is builtin, then jobs will be executed in the order of submission for a given partition. Even if resources are available to initiate jobs immediately, it will be deferred until no previously submitted job is pending.
  • If the scheduler type is backfill, then jobs will generally be executed in the order of submission for a given partition with one exception: later submitted jobs will be initiated early if doing so does not delay the expected execution time of an earlier submitted job. In order for backfill scheduling to be effective, users jobs should specify reasonable time limits. If jobs do not specify time limits, then all jobs will receive the same time limit (that associated with the partition), and the ability to backfill schedule jobs will be limited. The backfill scheduler does not alter job specifications of required or excluded nodes, so jobs which specify nodes will substantially reduce the effectiveness of backfill scheduling. See the backfill documentation for more details.

Jobs and nodes are stuck in COMPLETING state

This is typically due to non-killable processes associated with the job. Slurm will continue to attempt terminating the processes with SIGKILL, but some jobs may be stuck performing I/O and non-killable. This is typically due to a file system problem and may be addressed in a couple of ways.

  1. Fix the file system and/or reboot the node. -OR-
  2. Set the node to a DOWN state and then return it to service ("scontrol update NodeName=<node> State=down Reason=hung_proc" and "scontrol update NodeName=<node> State=resume"). This permits other jobs to use the node, but leaves the non-killable process in place. If the process should ever complete the I/O, the pending SIGKILL should terminate it immediately. -OR-
  3. Use the UnkillableStepProgram and UnkillableStepTimeout configuration parameters to automatically respond to processes which can not be killed, by sending email or rebooting the node. For more information, see the slurm.conf documentation.

Notes are getting set to a DOWN state

  1. Check the reason why the node is down using the command "scontrol show node <name>". This will show the reason why the node was set down and the time when it happened. If there is insufficient disk space, memory space, etc. compared to the parameters specified in the slurm.conf file then either fix the node or change slurm.conf.
  2. If the reason is "Not responding", then check communications between the control machine and the DOWN node using the command "ping <address>" being sure to specify the NodeAddr values configured in slurm.conf. If ping fails, then fix the network or addresses in slurm.conf.
  3. Next, login to a node tha. Slurm considers to be in a DOWN state and check if the slurmd daemon is running with the command "ps -el | grep slurmd". If slurmd is not running, restart it (typically as user root using the command "/etc/init.d/slurm start"). You should check the log file (SlurmdLog in the slurm.conf file) for an indication of why it failed. You can get the status of the running slurmd daemon by executing the command "scontrol show slurmd" on the node of interest. Check the value of "Last slurmctld msg time" to determine if the slurmctld is able to communicate with the slurmd. If it keeps failing, you should contact the slurm team for help at slurm-dev@schedmd.com.
  4. If slurmd is running but not responding (a very rare situation), then kill and restart it (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm start").
  5. If still not responding, try again to rule out network and configuration problems.
  6. If still not responding, increase the verbosity of debug messages (increase SlurmdDebug in the slurm.conf file) and restart. Again check the log file for an indication of why it failed. At this point, you should contact the slurm team for help at slurm-dev@schedmd.com.
  7. If still not responding without an indication as to the failure mode, restart without preserving state (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm startclean"). Note: All jobs and other state information on that node will be lost.

Networking and configuration problems

  1. Check the controller and/or slurmd log files (SlurmctldLog and SlurmdLog in the slurm.conf file) for an indication of why it is failing.
  2. Check for consistent slurm.conf and credential files on the node(s) experiencing problems.
  3. If this is user-specific problem, check that the user is configured on the controller computer(s) as well as the compute nodes. The user doesn't need to be able to login, but his user ID must exist.
  4. Check that compatible versions of Slurm exists on all of the nodes (execute "sinfo -V" or "rpm -qa | grep slurm"). The Slurm version numbers contain three digits, which represent the major, minor and micro release numbers in that order (e.g. 14.11.3 is major=14, minor=11, micro=3). Changes in the RPCs (remote procedure calls) and state files will only be made if the major and/or minor release number changes. Slurm daemons will support RPCs and state files from the two previous minor or releases (e.g. a version 15.08.x SlurmDBD will support slurmctld daemons and commands with a version of 14.03.x or 14.11.x).

Bluegene: Why is a block in an error state

  1. Check the controller log file (SlurmctldLog in the slurm.conf file) for an indication of why it is failing. (grep for update_block:)
  2. If the reason was something that happened to the system like a failed boot or a nodecard going bad or something like that you will need to fix the problem and then manually set the block to free.

Bluegene: How to make it so no jobs will run on a block

  1. Set the block state to be in error manually.
  2. When you are ready to run jobs again on the block manually set the block to free.

Bluegene: Static blocks in bluegene.conf file not loading

  1. Run "smap -Dc"
  2. When it comes up type "load /path/to/bluegene.conf".
  3. This should give you some reasons why which block it is having problems loading.
  4. Note the blocks in the bluegene.conf file must be in the same order smap created them or you may encounter some problems loading the configuration.
  5. If you need help creating a loadable bluegene.conf file click here

Bluegene: How to free a block(s) manually

  • Using sfree
    1. To free a specific block run "sfree -b BLOCKNAME".
    2. To free all the blocks on the system run "sfree -a".
  • Using scontrol
    1. Run "scontrol update state=FREE BlockName=BLOCKNAME".

Bluegene: How to set a block in an error state manually

  1. Run "scontrol update state=ERROR BlockName=BLOCKNAME".

Bluegene: How to set a sub base partition which doesn't have a block already created in an error state manually

  1. Run "scontrol update state=ERROR subBPName=IONODE_LIST".
  2. IONODE_LIST is a list of the ionodes you want to down in a certain base partition i.e. bg000[0-3] will down the first 4 ionodes in base partition 000.

Bluegene: How to make a bluegene.conf file that will load in Slurm

  1. See the Bluegene admin guide

Last modified 15 December 2016