Frequently Asked Questions

For Management

  1. Why should I use Slurm or other Free Open Source Software (FOSS)

For Users

  1. Why is my job/node in COMPLETING state?
  2. Why are my resource limits not propagated?
  3. Why is my job not running?
  4. Why does the srun --overcommit option not permit multiple jobs to run on nodes?
  5. Why is my job killed prematurely?
  6. Why are my srun options ignored?
  7. Why is the Slurm backfill scheduler not starting my job?
  8. How can I run multiple jobs from within a single script?
  9. Why do I have job steps when my job has already COMPLETED?
  10. How can I run a job within an existing job allocation?
  11. How does Slurm establish the environment for my job?
  12. How can I get shell prompts in interactive mode?
  13. How can I get the task ID in the output or error file name for a batch job?
  14. Can the make command utilize the resources allocated to a Slurm job?
  15. Can tasks be launched with a remote terminal?
  16. What does "srun: Force Terminated job" indicate?
  17. What does this mean: "srun: First task exited 30s ago" followed by "srun Job Failed"?
  18. Why is my MPI job failing due to the locked memory (memlock) limit being too low?
  19. Why is my batch job that launches no job steps being killed?
  20. How do I run specific tasks on certain nodes in my allocation?
  21. How can I temporarily prevent a job from running (e.g. place it into a hold state)?
  22. Why are jobs not getting the appropriate memory limit?
  23. Is an archive available of messages posted to the slurm-dev mailing list?
  24. Can I change my job's size after it has started running?
  25. Why is my MPIHCH2 or MVAPICH2 job not running with Slurm? Why does the DAKOTA program not run with Slurm?
  26. Why does squeue (and "scontrol show jobid") sometimes not display a job's estimated start time?
  27. How can I run an Ansys program with Slurm?
  28. How can I run programs with on an Intel Phi (MIC) processor?
  29. How can a job in complete or failed state be requeued?
  30. Slurm documentation refers to CPUs, cores and threads. What exactly is considered a CPU?
  31. What is the difference between the sbatch and srun commands?
  32. Can squeue output be color coded?

For Administrators

  1. How is job suspend/resume useful?
  2. How can I configure Slurm to use the resources actually found on a node rather than what is defined in slurm.conf?
  3. Why is a node shown in state DOWN when the node has registered for service?
  4. What happens when a node crashes?
  5. How can I control the execution of multiple jobs per node?
  6. When the Slurm daemon starts, it prints "cannot resolve X plugin operations" and exits. What does this mean?
  7. How can I exclude some users from pam_slurm?
  8. How can I dry up the workload for a maintenance period?
  9. How can PAM be used to control a user's limits on or access to compute nodes?
  10. Why are jobs allocated nodes and then unable to initiate programs on some nodes?
  11. Why does slurmctld log that some nodes are not responding even if they are not in any partition?
  12. How should I relocate the primary or backup controller?
  13. Can multiple Slurm systems be run in parallel for testing purposes?
  14. Can Slurm emulate a larger cluster?
  15. Can Slurm emulate nodes with more resources than physically exist on the node?
  16. What does a "credential replayed" error in the SlurmdLogFile indicate?
  17. What does "Warning: Note very large processing time" in the SlurmctldLogFile indicate?
  18. How can I add support for lightweight core files?
  19. Is resource limit propagation useful on a homogeneous cluster?
  20. Do I need to maintain synchronized clocks on the cluster?
  21. Why are "Invalid job credential" errors generated?
  22. Why are "Task launch failed on node ... Job credential replayed" errors generated?
  23. Can Slurm be used with Globus?
  24. What causes the error "Unable to accept new connection: Too many open files"?
  25. Why does the setting of SlurmdDebug fail to log job step information at the appropriate level?
  26. Why isn't the auth_none.so (or other file) in a Slurm RPM?
  27. Why should I use the slurmdbd instead of the regular database plugins?
  28. How can I build Slurm with debugging symbols?
  29. How can I easily preserve drained node information between major Slurm updates?
  30. Why doesn't the HealthCheckProgram execute on DOWN nodes?
  31. What is the meaning of the error "Batch JobId=# missing from master node, killing it"?
  32. What does the message "srun: error: Unable to accept connection: Resources temporarily unavailable" indicate?
  33. How could I automatically print a job's Slurm job ID to its standard output?
  34. Why are user processes and srun running even though the job is supposed to be completed?
  35. How can I prevent the slurmd and slurmstepd daemons from being killed when a node's memory is exhausted?
  36. I see my host of my calling node as 127.0.1.1 instead of the correct IP address. Why is that?
  37. How can I stop Slurm from scheduling jobs?
  38. Can I update multiple jobs with a single scontrol command?
  39. Can Slurm be used to run jobs on Amazon's EC2?
  40. If a Slurm daemon core dumps, where can I find the core file?
  41. How can TotalView be configured to operate with Slurm?
  42. How can a patch file be generated from a Slurm commit in github?
  43. Why are the resource limits set in the database not being enforced?
  44. After manually setting a job priority value, how can it's priority value be returned to being managed by the priority/multifactor plugin?
  45. Does any one have an example node health check script for Slurm?
  46. What process should I follow to add nodes to Slurm?
  47. Can Slurm be configured to manage licenses?
  48. Can the salloc command be configured to launch a shell on a node in the job's allocation?
  49. What should I be aware of when upgrading Slurm?
  50. How easy is it to switch from PBS or Torque to Slurm?
  51. I am having trouble using SSSD with Slurm.
  52. How critical is configuring high availability for my database?
  53. How can I use double quotes in MySQL queries?
  54. Why is a compute node down with the reason set to "Node unexpectedly rebooted"?
  55. How can a job which has exited with a specific exit code be requeued?
  56. Can a user's account be changed in the database?
  57. What might account for MPI performance being below the expected level?
  58. How could some jobs submitted immediately before the slurmctld daemon crashed be lost?
  59. How do I safely remove partitions?
  60. Why is Slurm unable to set the CPU frequency for jobs?
  61. How can Slurm be configured to support Intel Phi (MIC)?
  62. When adding a new cluster, how can the Slurm cluster configuration be copied from an existing cluster to the new cluster?
  63. How can I update Slurm on a Cray DVS file system without rebooting the nodes?
  64. How can I rebuild the database hierarchy?
  65. How can a routing queue be configured?
  66. How can I suspend, resume, hold or release all of the jobs belonging to a speciic user, partition, etc?
  67. I had to change a user's UID and now they cannot submit jobs. How do I get the new UID to take effect?

For Management

1. Why should I use Slurm or other Free Open Source Software (FOSS)?
Free Open Source Software (FOSS) does not mean that it is without cost. It does mean that the you have access to the code so that you are free to use it, study it, and/or enhance it. These reasons contribute to Slurm (and FOSS in general) being subject to active research and development worldwide, displacing proprietary software in many environments. If the software is large and complex, like Slurm or the Linux kernel, then while there is no license fee, its use is not without cost.

If your work is important, you'll want the leading Slurm experts at your disposal to keep your systems operating at peak efficiency. While Slurm has a global development community incorporating leading edge technology, SchedMD personnel have developed most of the code and can provide competitively priced commercial support. SchedMD works with various organizations to provide a range of support options ranging from remote level-3 support to 24x7 on-site personnel. Customers switching from commercial workload mangers to Slurm typically report higher scalability, better performance and lower costs.

For Users

1. Why is my job/node in COMPLETING state?
When a job is terminating, both the job and its nodes enter the COMPLETING state. As the Slurm daemon on each node determines that all processes associated with the job have terminated, that node changes state to IDLE or some other appropriate state for use by other jobs. When every node allocated to a job has determined that all processes associated with it have terminated, the job changes state to COMPLETED or some other appropriate state (e.g. FAILED). Normally, this happens within a second. However, if the job has processes that cannot be terminated with a SIGKILL signal, the job and one or more nodes can remain in the COMPLETING state for an extended period of time. This may be indicative of processes hung waiting for a core file to complete I/O or operating system failure. If this state persists, the system administrator should check for processes associated with the job that cannot be terminated then use the scontrol command to change the node's state to DOWN (e.g. "scontrol update NodeName=name State=DOWN Reason=hung_completing"), reboot the node, then reset the node's state to IDLE (e.g. "scontrol update NodeName=name State=RESUME"). Note that setting the node DOWN will terminate all running or suspended jobs associated with that node. An alternative is to set the node's state to DRAIN until all jobs associated with it terminate before setting it DOWN and re-booting.

Note that Slurm has two configuration parameters that may be used to automate some of this process. UnkillableStepProgram specifies a program to execute when non-killable processes are identified. UnkillableStepTimeout specifies how long to wait for processes to terminate. See the "man slurm.conf" for more information about these parameters.

2. Why are my resource limits not propagated?
When the srun command executes, it captures the resource limits in effect at submit time on the node where srun executes. These limits are propagated to the allocated nodes before initiating the user's job. The Slurm daemons running on the allocated nodes then try to establish identical resource limits for the job being initiated. There are several possible reasons for not being able to establish those resource limits.

  • The hard resource limits applied to Slurm's slurmd daemon are lower than the user's soft resources limits on the submit host. Typically the slurmd daemon is initiated by the init daemon with the operating system default limits. This may be addressed either through use of the ulimit command in the /etc/sysconfig/slurm file or enabling PAM in Slurm.
  • The user's hard resource limits on the allocated node are lower than the same user's soft hard resource limits on the node from which the job was submitted. It is recommended that the system administrator establish uniform hard resource limits for users on all nodes within a cluster to prevent this from occurring.

NOTE: This may produce the error message "Can't propagate RLIMIT_...". The error message is printed only if the user explicitly specifies that the resource limit should be propagated or the srun command is running with verbose logging of actions from the slurmd daemon (e.g. "srun -d6 ...").

3. Why is my job not running?
The answer to this question depends upon the scheduler used by Slurm. Executing the command

scontrol show config | grep SchedulerType

will supply this information. If the scheduler type is builtin, then jobs will be executed in the order of submission for a given partition. Even if resources are available to initiate your job immediately, it will be deferred until no previously submitted job is pending. If the scheduler type is backfill, then jobs will generally be executed in the order of submission for a given partition with one exception: later submitted jobs will be initiated early if doing so does not delay the expected execution time of an earlier submitted job. In order for backfill scheduling to be effective, users' jobs should specify reasonable time limits. If jobs do not specify time limits, then all jobs will receive the same time limit (that associated with the partition), and the ability to backfill schedule jobs will be limited. The backfill scheduler does not alter job specifications of required or excluded nodes, so jobs which specify nodes will substantially reduce the effectiveness of backfill scheduling. See the backfill section for more details. For any scheduler, you can check priorities of jobs using the command scontrol show job.

4. Why does the srun --overcommit option not permit multiple jobs to run on nodes?
The --overcommit option is a means of indicating that a job or job step is willing to execute more than one task per processor in the job's allocation. For example, consider a cluster of two processor nodes. The srun execute line may be something of this sort

srun --ntasks=4 --nodes=1 a.out

This will result in not one, but two nodes being allocated so that each of the four tasks is given its own processor. Note that the srun --nodes option specifies a minimum node count and optionally a maximum node count. A command line of

srun --ntasks=4 --nodes=1-1 a.out

would result in the request being rejected. If the --overcommit option is added to either command line, then only one node will be allocated for all four tasks to use.

More than one job can execute simultaneously on the same compute resource (e.g. CPU) through the use of srun's --oversubscribe option in conjunction with the OverSubscribe parameter in Slurm's partition configuration. See the man pages for srun and slurm.conf for more information.

5. Why is my job killed prematurely?
Slurm has a job purging mechanism to remove inactive jobs (resource allocations) before reaching its time limit, which could be infinite. This inactivity time limit is configurable by the system administrator. You can check its value with the command

scontrol show config | grep InactiveLimit

The value of InactiveLimit is in seconds. A zero value indicates that job purging is disabled. A job is considered inactive if it has no active job steps or if the srun command creating the job is not responding. In the case of a batch job, the srun command terminates after the job script is submitted. Therefore batch job pre- and post-processing is limited to the InactiveLimit. Contact your system administrator if you believe the InactiveLimit value should be changed.

6. Why are my srun options ignored?
Everything after the command srun is examined to determine if it is a valid option for srun. The first token that is not a valid option for srun is considered the command to execute and everything after that is treated as an option to the command. For example:

srun -N2 hostname -pdebug

srun processes "-N2" as an option to itself. "hostname" is the command to execute and "-pdebug" is treated as an option to the hostname command. This will change the name of the computer on which Slurm executes the command - Very bad, Don't run this command as user root!

7. Why is the Slurm backfill scheduler not starting my job?
The most common problem is failing to set job time limits. If all jobs have the same time limit (for example the partition's time limit), then backfill will not be effective. Note that partitions can have both default and maximum time limits, which can be helpful in configuring a system for effective backfill scheduling.

In addition, there are a multitude of backfill scheduling parameters which can impact which jobs are considered for backfill scheduling, such as the maximum number of jobs tested per user. For more information see the slurm.conf man page and check the configuration of SchedulingParameters on your system.

8. How can I run multiple jobs from within a single script?
A Slurm job is just a resource allocation. You can execute many job steps within that allocation, either in parallel or sequentially. Some jobs actually launch thousands of job steps this way. The job steps will be allocated nodes that are not already allocated to other job steps. This essential provides a second level of resource management within the job for the job steps.

9. Why do I have job steps when my job has already COMPLETED?
NOTE: This only applies to systems configured with SwitchType=switch/nrt. All other systems will purge all job steps on job completion.

Slurm maintains switch (network interconnect) information within the job step for IBM NRT switches. This information must be maintained until we are absolutely certain that the processes associated with the switch have been terminated to avoid the possibility of re-using switch resources for other jobs (even on different nodes). Slurm considers jobs COMPLETED when all nodes allocated to the job are either DOWN or confirm termination of all its processes. This enables Slurm to purge job information in a timely fashion even when there are many failing nodes. Unfortunately the job step information may persist longer.

10. How can I run a job within an existing job allocation?
There is a srun option --jobid that can be used to specify a job's ID. For a batch job or within an existing resource allocation, the environment variable SLURM_JOB_ID has already been defined, so all job steps will run within that job allocation unless otherwise specified. The one exception to this is when submitting batch jobs. When a batch job is submitted from within an existing batch job, it is treated as a new job allocation request and will get a new job ID unless explicitly set with the --jobid option. If you specify that a batch job should use an existing allocation, that job allocation will be released upon the termination of that batch job.

11. How does Slurm establish the environment for my job?
Slurm processes are not run under a shell, but directly exec'ed by the slurmd daemon (assuming srun is used to launch the processes). The environment variables in effect at the time the srun command is executed are propagated to the spawned processes. The ~/.profile and ~/.bashrc scripts are not executed as part of the process launch.

12. How can I get shell prompts in interactive mode?
srun --pty bash -i
Srun's --pty option runs task zero in pseudo terminal mode. Bash's -i option tells it to run in interactive mode (with prompts).

13. How can I get the task ID in the output or error file name for a batch job?

If you want separate output by task, you will need to build a script containing this specification. For example:

$ cat test
#!/bin/sh
echo begin_test
srun -o out_%j_%t hostname

$ sbatch -n7 -o out_%j test
sbatch: Submitted batch job 65541

$ ls -l out*
-rw-rw-r--  1 jette jette 11 Jun 15 09:15 out_65541
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_0
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_1
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_2
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_3
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_4
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_5
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_6

$ cat out_65541
begin_test

$ cat out_65541_2
tdev2

14. Can the make command utilize the resources allocated to a Slurm job?
Yes. There is a patch available for GNU make version 3.81 available as part of the Slurm distribution in the file contribs/make-3.81.slurm.patch. For GNU make version 4.0 you can use the patch in the file contribs/make-4.0.slurm.patch. This patch will use Slurm to launch tasks across a job's current resource allocation. Depending upon the size of modules to be compiled, this may or may not improve performance. If most modules are thousands of lines long, the use of additional resources should more than compensate for the overhead of Slurm's task launch. Use with make's -j option within an existing Slurm allocation. Outside of a Slurm allocation, make's behavior will be unchanged.

15. Can tasks be launched with a remote terminal?
In Slurm version 1.3 or higher, use srun's --pty option. Until then, you can accomplish this by starting an appropriate program or script. In the simplest case (X11 over TCP with the DISPLAY environment already set), executing srun xterm may suffice. In the more general case, the following scripts should work. NOTE: The pathname to the additional scripts are included in the variables BS and IS of the first script. You must change this in the first script. Execute the script with the sbatch options desired. For example, interactive -N2 -pdebug.

#!/bin/bash
# -*- coding: utf-8 -*-
# Author: Pär Andersson (National Supercomputer Centre, Sweden)
# Version: 0.3 2007-07-30
#
# This will submit a batch script that starts screen on a node.
# Then ssh is used to connect to the node and attach the screen.
# The result is very similar to an interactive shell in PBS
# (qsub -I)

# Batch Script that starts SCREEN
BS=/INSTALL_DIRECTORY/_interactive
# Interactive screen script
IS=/INSTALL_DIRECTORY/_interactive_screen

# Submit the job and get the job id
JOB=`sbatch --output=/dev/null --error=/dev/null $@ $BS 2>&1 \
    | egrep -o -e "\b[0-9]+$"`

# Make sure the job is always canceled
trap "{ /usr/bin/scancel -q $JOB; exit; }" SIGINT SIGTERM EXIT

echo "Waiting for JOBID $JOB to start"
while true;do
    sleep 5s

    # Check job status
    STATUS=`squeue -j $JOB -t PD,R -h -o %t`

    if [ "$STATUS" = "R" ];then
	# Job is running, break the while loop
	break
    elif [ "$STATUS" != "PD" ];then
	echo "Job is not Running or Pending. Aborting"
	scancel $JOB
	exit 1
    fi

    echo -n "."

done

# Determine the first node in the job:
NODE=`srun --jobid=$JOB -N1 hostname`

# SSH to the node and attach the screen
sleep 1s
ssh -X -t $NODE $IS slurm$JOB
# The trap will now cancel the job before exiting.

NOTE: The above script executes the script below, named _interactive.

#!/bin/sh
# -*- coding: utf-8 -*-
# Author: Pär Andersson  (National Supercomputer Centre, Sweden)
# Version: 0.2 2007-07-30
#
# Simple batch script that starts SCREEN.

exec screen -Dm -S slurm$SLURM_JOB_ID

The following script named _interactive_screen is also used.

#!/bin/sh
# -*- coding: utf-8 -*-
# Author: Pär Andersson  (National Supercomputer Centre, Sweden)
# Version: 0.3 2007-07-30
#

SCREENSESSION=$1

# If DISPLAY is set then set that in the screen, then create a new
# window with that environment and kill the old one.
if [ "$DISPLAY" != "" ];then
    screen -S $SCREENSESSION -X unsetenv DISPLAY
    screen -p0 -S $SCREENSESSION -X setenv DISPLAY $DISPLAY
    screen -p0 -S $SCREENSESSION -X screen
    screen -p0 -S $SCREENSESSION -X kill
fi

exec screen -S $SCREENSESSION -rd

16. What does "srun: Force Terminated job" indicate?
The srun command normally terminates when the standard output and error I/O from the spawned tasks end. This does not necessarily happen at the same time that a job step is terminated. For example, a file system problem could render a spawned task non-killable at the same time that I/O to srun is pending. Alternately a network problem could prevent the I/O from being transmitted to srun. In any event, the srun command is notified when a job step is terminated, either upon reaching its time limit or being explicitly killed. If the srun has not already terminated, the message "srun: Force Terminated job" is printed. If the job step's I/O does not terminate in a timely fashion thereafter, pending I/O is abandoned and the srun command exits.

17. What does this mean: "srun: First task exited 30s ago" followed by "srun Job Failed"?
The srun command monitors when tasks exit. By default, 30 seconds after the first task exists, the job is killed. This typically indicates some type of job failure and continuing to execute a parallel job when one of the tasks has exited is not normally productive. This behavior can be changed using srun's --wait=<time> option to either change the timeout period or disable the timeout altogether. See srun's man page for details.

18. Why is my MPI job failing due to the locked memory (memlock) limit being too low?
By default, Slurm propagates all of your resource limits at the time of job submission to the spawned tasks. This can be disabled by specifically excluding the propagation of specific limits in the slurm.conf file. For example PropagateResourceLimitsExcept=MEMLOCK might be used to prevent the propagation of a user's locked memory limit from a login node to a dedicated node used for his parallel job. If the user's resource limit is not propagated, the limit in effect for the slurmd daemon will be used for the spawned job. A simple way to control this is to insure that user root has a sufficiently large resource limit and insuring that slurmd takes full advantage of this limit. For example, you can set user root's locked memory limit ulimit to be unlimited on the compute nodes (see "man limits.conf") and insuring that slurmd takes full advantage of this limit (e.g. by adding something like "ulimit -l unlimited" to the /etc/init.d/slurm script used to initiate slurmd). It may also be desirable to lock the slurmd daemon's memory to help insure that it keeps responding if memory swapping begins. A sample /etc/sysconfig/slurm file is shown below. Related information about PAM is also available.

#
# Example /etc/sysconfig/slurm
#
# Increase the memlock limit so that user tasks can get
# unlimited memlock
ulimit -l unlimited
#
# Increase the open file limit
ulimit -n 8192
#
# Memlocks the slurmd process's memory so that if a node
# starts swapping, the slurmd will continue to respond
SLURMD_OPTIONS="-M"

19. Why is my batch job that launches no job steps being killed?
Slurm has a configuration parameter InactiveLimit intended to kill jobs that do not spawn any job steps for a configurable period of time. Your system administrator may modify the InactiveLimit to satisfy your needs. Alternately, you can just spawn a job step at the beginning of your script to execute in the background. It will be purged when your script exits or your job otherwise terminates. A line of this sort near the beginning of your script should suffice:
srun -N1 -n1 sleep 999999 &

20. How do I run specific tasks on certain nodes in my allocation?
One of the distribution methods for srun '-m or --distribution' is 'arbitrary'. This means you can tell Slurm to layout your tasks in any fashion you want. For instance if I had an allocation of 2 nodes and wanted to run 4 tasks on the first node and 1 task on the second and my nodes allocated from SLURM_NODELIST where tux[0-1] my srun line would look like this:

srun -n5 -m arbitrary -w tux[0,0,0,0,1] hostname

If I wanted something similar but wanted the third task to be on tux 1 I could run this:

srun -n5 -m arbitrary -w tux[0,0,1,0,0] hostname

Here is a simple perl script named arbitrary.pl that can be ran to easily lay out tasks on nodes as they are in SLURM_NODELIST.

#!/usr/bin/perl
my @tasks = split(',', $ARGV[0]);
my @nodes = `scontrol show hostnames $SLURM_NODELIST`;
my $node_cnt = $#nodes + 1;
my $task_cnt = $#tasks + 1;

if ($node_cnt < $task_cnt) {
	print STDERR "ERROR: You only have $node_cnt nodes, but requested layout on $task_cnt nodes.\n";
	$task_cnt = $node_cnt;
}

my $cnt = 0;
my $layout;
foreach my $task (@tasks) {
	my $node = $nodes[$cnt];
	last if !$node;
	chomp($node);
	for(my $i=0; $i < $task; $i++) {
		$layout .= "," if $layout;
		$layout .= "$node";
	}
	$cnt++;
}
print $layout;

We can now use this script in our srun line in this fashion.

srun -m arbitrary -n5 -w `arbitrary.pl 4,1` -l hostname

This will layout 4 tasks on the first node in the allocation and 1 task on the second node.

21. How can I temporarily prevent a job from running (e.g. place it into a hold state)?
The easiest way to do this is to change a job's earliest begin time (optionally set at job submit time using the --begin option). The example below places a job into hold state (preventing its initiation for 30 days) and later permitting it to start now.

$ scontrol update JobId=1234 StartTime=now+30days
... later ...
$ scontrol update JobId=1234 StartTime=now

22. Why are jobs not getting the appropriate memory limit?
This is probably a variation on the locked memory limit problem described above. Use the same solution for the AS (Address Space), RSS (Resident Set Size), or other limits as needed.

23. Is an archive available of messages posted to the slurm-dev mailing list?
Yes, it is at http://groups.google.com/group/slurm-devel

24. Can I change my job's size after it has started running?
Slurm supports the ability to both increase and decrease the size of running jobs. While the size of a pending job may be changed with few restrictions, several significant restrictions apply to changing the size of a running job as noted below:

  1. Support is not available on BlueGene or Cray system due to limitations in the software underlying Slurm.
  2. Job(s) changing size must not be in a suspended state, including jobs suspended for gang scheduling. The jobs must be in a state of pending or running. We plan to modify the gang scheduling logic in the future to concurrently schedule a job to be used for expanding another job and the job to be expanded.

Use the scontrol command to change a job's size either by specifying a new node count (NumNodes=) for the job or identify the specific nodes (NodeList=) that you want the job to retain. Any job steps running on the nodes which are relinquished by the job will be killed unless initiated with the --no-kill option. After the job size is changed, some environment variables created by Slurm containing information about the job's environment will no longer be valid and should either be removed or altered (e.g. SLURM_NNODES, SLURM_NODELIST and SLURM_NPROCS). The scontrol command will generate a script that can be executed to reset local environment variables. You must retain the SLURM_JOB_ID environment variable in order for the srun command to gather information about the job's current state and specify the desired node and/or task count in subsequent srun invocations. A new accounting record is generated when a job is resized showing the to have been resubmitted and restarted at the new size. An example is shown below.

#!/bin/bash
srun my_big_job
scontrol update JobId=$SLURM_JOB_ID NumNodes=2
. slurm_job_${SLURM_JOB_ID}_resize.sh
srun -N2 my_small_job
rm slurm_job_${SLURM_JOB_ID}_resize.*

Increasing a job's size
Directly increasing the size of a running job would adversely effect the scheduling of pending jobs. For the sake of fairness in job scheduling, expanding a running job requires the user to submit a new job, but specify the option --dependency=expand:<jobid>. This option tells Slurm that the job, when scheduled, can be used to expand the specified jobid. Other job options would be used to identify the required resources (e.g. task count, node count, node features, etc.). This new job's time limit will be automatically set to reflect the end time of the job being expanded. This new job's generic resources specification will be automatically set equal to that of the job being merged to. This is due to the current Slurm restriction of all nodes associated with a job needing to have the same generic resource specification (i.e. a job can not have one GPU on one node and two GPUs on another node), although this restriction may be removed in the future. This restriction can pose some problems when both jobs can be allocated resources on the same node, in which case the generic resources allocated to the new job will be released. If the jobs are allocated resources on different nodes, the generic resources associated with the resulting job allocation after the merge will be consistent as expected. Any licenses associated with the new job will be added to those available in the job being merged to. Note that partition and Quality Of Service (QOS) limits will be applied independently to the new job allocation so the expanded job may exceed size limits configured for an individual job.

After the new job is allocated resources, merge that job's allocation into that of the original job by executing:
scontrol update jobid=<jobid> NumNodes=0
The jobid above is that of the job to relinquish it's resources. To provides more control over when the job expansion occurs, the resources are not merged into the original job until explicitly requested. These resources will be transferred to the original job and the scontrol command will generate a script to reset variables in the second job's environment to reflect it's modified resource allocation (which would be no resources). One would normally exit this second job at this point, since it has no associated resources. In order to generate a script to modify the environment variables for the expanded job, execute:
scontrol update jobid=<jobid> NumNodes=ALL
Then execute the script generated. Note that this command does not change the original job's size, but only generates the script to change its environment variables. Until the environment variables are modified (e.g. the job's node count, CPU count, hostlist, etc.), any srun command will only consider the resources in the original resource allocation. Note that the original job may have active job steps at the time of it's expansion, but they will not be effected by the change. An example of the procedure is shown below in which the original job allocation waits until the second resource allocation request can be satisfied. The job requesting additional resources could also use the sbatch command and permit the original job to continue execution at its initial size. Note that the development of additional user tools to manage Slurm resource allocations is planned in the future to make this process both simpler and more flexible.

$ salloc -N4 -C haswell bash
salloc: Granted job allocation 65542
$ srun hostname
icrm1
icrm2
icrm3
icrm4

$ salloc -N4 -C knl,snc4,flat --dependency=expand:$SLURM_JOB_ID bash
salloc: Granted job allocation 65543
$ scontrol update jobid=$SLURM_JOB_ID NumNodes=0
To reset Slurm environment variables, execute
  For bash or sh shells:  . ./slurm_job_65543_resize.sh
  For csh shells:         source ./slurm_job_65543_resize.csh
$ exit
exit
salloc: Relinquishing job allocation 65543

$ scontrol update jobid=$SLURM_JOB_ID NumNodes=ALL
To reset Slurm environment variables, execute
  For bash or sh shells:  . ./slurm_job_65542_resize.sh
  For csh shells:         source ./slurm_job_65542_resize.csh
$ . ./slurm_job_$SLURM_JOB_ID_resize.sh

$ srun hostname
icrm1
icrm2
icrm3
icrm4
icrm5
icrm6
icrm7
icrm8
$ exit
exit
salloc: Relinquishing job allocation 65542

25. Why is my MPIHCH2 or MVAPICH2 job not running with Slurm? Why does the DAKOTA program not run with Slurm?
The Slurm library used to support MPIHCH2 or MVAPICH2 references a variety of symbols. If those symbols resolve to functions or variables in your program rather than the appropriate library, the application will fail. For example DAKOTA, versions 5.1 and older, contains a function named regcomp, which will get used rather than the POSIX regex functions. Rename DAKOTA's function and references from regcomp to something else to make it work properly.

26. Why does squeue (and "scontrol show jobid") sometimes not display a job's estimated start time?
When the backfill scheduler is configured, it provides an estimated start time for jobs that are candidates for backfill. Pending jobs with dependencies will not have an estimate as it is difficult to predict what resources will be available when the jobs they are dependent on terminate. Also note that the estimate is better for jobs expected to start soon, as most running jobs end before their estimated time. There are other restrictions on backfill that may apply. See the backfill section for more details.

27. How can I run an Ansys program with Slurm?
If you are talking about an interactive run of the Ansys app, then you can use this simple script (it is for Ansys Fluent):

$ cat ./fluent-srun.sh
#!/usr/bin/env bash
HOSTSFILE=.hostlist-job$SLURM_JOB_ID
if [ "$SLURM_PROCID" == "0" ]; then
   srun hostname -f > $HOSTSFILE
   fluent -t $SLURM_NTASKS -cnf=$HOSTSFILE -ssh 3d
   rm -f $HOSTSFILE
fi
exit 0

To run an interactive session, use srun like this:

$ srun -n  ./fluent-srun.sh

28. How can I run programs with on an Intel Phi (MIC) processor?
Two programming models are supported, offloading and native mode. System administrators should see the Intel Phi configuration information below. Slurm configuration details for Intel Phi offload support are available in Slurm's Generic Resource Guide. For a good description of how to build and run applications, please see CSC MIC documentation. Note that some of the information presented in this document is configuration dependent. The mpirun-mic is included in the Slurm distribution in the contribs/mic directory. Excerpts of the CSC documentation follow.

Executable Auto-Offloading
The Phi nodes have Executable Auto-Offloading (EAO) enabled by default. This feature is developed at CSC and is not currently in the standard Xeon Phi distribution. With this feature, any executable in the K1OM (MIC) binary format that the user tries to run on the host, will transparently be executed on the Xeon Phi coprocessor card instead. The execution is performed using the /usr/bin/micrun script.

By default all environment variables with the MIC_ prefix will be passed to the binary, with the prefix stripped away. For example (MIC_LD_LIBRARY_PATH -> LD_LIBRARY_PATH).

EAO can be disabled by setting the environment variable MICRUN_DISABLE (i.e. export MICRUN_DISABLE=1).

Offload programming model
The Intel compilers support offload compilation automatically. This means either offloading a code section using offload pragmas or calling an offload-enabled library. (e.g. MKL).

In order to run offload jobs, one needs to set the GRES (Generic Resource Scheduling) parameter '--gres=mic:1'. For example:

$ srun --gres=mic:1 ./hello

If this is not set, the user will the following warning:

offload warning: OFFLOAD_DEVICES device number -1 does not correspond to a physical device

Native OpenMP code
To compile OpenMP code natively, you can use the -mmicflag.

$ module load intel
$ icc -mmic -openmp hello.c -o hello.mic

To run, use the srun command. You may need to explicitly specify a Slurm partition containing MIC processors, for example:

$ srun -p mic ./hello.mic

29. How can a job in complete or failed state be requeued?

Slurm supports requeue jobs in done or failed state. Use the command:

scontrol requeue job_id

The job will be requeued back in PENDING state and scheduled again. See man(1) scontrol.

Consider a simple job like this:

$cat zoppo
#!/bin/sh
echo "hello, world"
exit 10

$sbatch -o here ./zoppo
Submitted batch job 10

The job finishes in FAILED state because it exits with a non zero value. We can requeue the job back to the PENDING state and the job will be dispatched again.

$->scontrol requeue 10
$->squeue
     JOBID PARTITION  NAME     USER   ST   TIME  NODES NODELIST(REASON)
      10      mira    zoppo    david  PD   0:00    1   (NonZeroExitCode)
$->squeue
    JOBID PARTITION   NAME     USER ST     TIME  NODES NODELIST(REASON)
      10      mira    zoppo    david  R    0:03    1      alanz1

Slurm supports requeuing jobs in hold state with the command:

'scontrol requeuehold job_id'

The job can be in state RUNNING, SUSPENDED, COMPLETED or FAILED before being requeued.

$->scontrol requeuehold 10
$->squeue
    JOBID PARTITION  NAME     USER ST       TIME  NODES NODELIST(REASON)
    10      mira    zoppo    david PD       0:00      1 (JobHeldUser)

30. Slurm documentation refers to CPUs, cores and threads. What exactly is considered a CPU?
If your nodes are configured with hyperthreading, then a CPU is equivalent to a hyperthread. Otherwise a CPU is equivalent to a core. You can determine if your nodes have more than one thread per core using the command "scontrol show node" and looking at the values of "ThreadsPerCore".

Note that even on systems with hyperthreading enabled, the resources will generally be allocated to jobs at the level of a core (see NOTE below). Two different jobs will not share a core except through the use of a partition OverSubscribe configuration parameter. For example, a job requesting resources for three tasks on a node with ThreadsPerCore=2 will be allocated two full cores. Note that Slurm commands contain a multitude of options to control resource allocation with respect to base boards, sockets, cores and threads.

(NOTE: An exception to this would be if the system administrator configured SelectTypeParameters=CR_CPU and each node's CPU count without its socket/core/thread specification. In that case, each thread would be independently scheduled as a CPU. This is not a typical configuration.)

31. What is the difference between the sbatch and srun commands?
The srun command has two different modes of operation. First, if not run within an existing job (i.e. not within a Slurm job allocation created by salloc or sbatch), then it will create a job allocation and spawn an application. If run within an existing allocation, the srun command only spawns the application. For this question, we will only address the first mode of operation and compare creating a job allocation using the sbatch and srun commands.

The srun command is designed for interactive use, with someone monitoring the output. The output of the application is seen as output of the srun command, typically at the user's terminal. The sbatch command is designed to submit a script for later execution and its output is written to a file. Command options used in the job allocation are almost identical. The most noticable difference in options is that the sbatch command supports the concept of job arrays, while srun does not. Another significant difference is in fault tolerance. Failures involving sbatch jobs typically result in the job being requeued and executed again, while failures involving srun typically result in an error message being generated with the expectation that the user will respond in an appropriate fashion.

32. Can squeue output be color coded?
The squeue command output is not color coded, but other tools can be used to add color. One such tool is ColorWrapper (https://github.com/rrthomas/cw). A sample ColorWrapper configuration file and output are shown below.

path /bin:/usr/bin:/sbin:/usr/sbin:
usepty
base green+
match red:default (Resources)
match black:default (null)
match black:cyan N/A
regex cyan:default  PD .*$
regex red:default ^\d*\s*C .*$
regex red:default ^\d*\s*CG .*$
regex red:default ^\d*\s*NF .*$
regex white:default ^JOBID.*

For Administrators

1. How is job suspend/resume useful?
Job suspend/resume is most useful to get particularly large jobs initiated in a timely fashion with minimal overhead. Say you want to get a full-system job initiated. Normally you would need to either cancel all running jobs or wait for them to terminate. Canceling jobs results in the loss of their work to that point from either their beginning or last checkpoint. Waiting for the jobs to terminate can take hours, depending upon your system configuration. A more attractive alternative is to suspend the running jobs, run the full-system job, then resume the suspended jobs. This can easily be accomplished by configuring a special queue for full-system jobs and using a script to control the process. The script would stop the other partitions, suspend running jobs in those partitions, and start the full-system partition. The process can be reversed when desired. One can effectively gang schedule (time-slice) multiple jobs using this mechanism, although the algorithms to do so can get quite complex. Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended.

2. How can I configure Slurm to use the resources actually found on a node rather than what is defined in slurm.conf?
Slurm can either base its scheduling decisions upon the node configuration defined in slurm.conf or what each node actually returns as available resources. This is controlled using the configuration parameter FastSchedule. Set its value to zero in order to use the resources actually found on each node, but with a higher overhead for scheduling. A value of one is the default and results in the node configuration defined in slurm.conf being used. See "man slurm.conf" for more details.

3. Why is a node shown in state DOWN when the node has registered for service?
The configuration parameter ReturnToService in slurm.conf controls how DOWN nodes are handled. Set its value to one in order for DOWN nodes to automatically be returned to service once the slurmd daemon registers with a valid node configuration. A value of zero is the default and results in a node staying DOWN until an administrator explicitly returns it to service using the command "scontrol update NodeName=whatever State=RESUME". See "man slurm.conf" and "man scontrol" for more details.

4. What happens when a node crashes?
A node is set DOWN when the slurmd daemon on it stops responding for SlurmdTimeout as defined in slurm.conf. The node can also be set DOWN when certain errors occur or the node's configuration is inconsistent with that defined in slurm.conf. Any active job on that node will be killed unless it was submitted with the srun option --no-kill. Any active job step on that node will be killed. See the slurm.conf and srun man pages for more information.

5. How can I control the execution of multiple jobs per node?
There are two mechanisms to control this. If you want to allocate individual processors on a node to jobs, configure SelectType=select/cons_res. See Consumable Resources in Slurm for details about this configuration. If you want to allocate whole nodes to jobs, configure configure SelectType=select/linear. Each partition also has a configuration parameter OverSubscribe that enables more than one job to execute on each node. See man slurm.conf for more information about these configuration parameters.

6. When the Slurm daemon starts, it prints "cannot resolve X plugin operations" and exits. What does this mean?
This means that symbols expected in the plugin were not found by the daemon. This typically happens when the plugin was built or installed improperly or the configuration file is telling the plugin to use an old plugin (say from the previous version of Slurm). Restart the daemon in verbose mode for more information (e.g. "slurmctld -Dvvvvv").

7.How can I exclude some users from pam_slurm?
CAUTION: Please test this on a test machine/VM before you actually do this on your Slurm computers.

Step 1. Make sure pam_listfile.so exists on your system. The following command is an example on Redhat 6:

ls -la /lib64/security/pam_listfile.so

Step 2. Create user list (e.g. /etc/ssh/allowed_users):

# /etc/ssh/allowed_users
root
myadmin

And, change file mode to keep it secret from regular users(Optional):

chmod 600 /etc/ssh/allowed_users

NOTE: root is not necessarily listed on the allowed_users, but I feel somewhat safe if it's on the list.

Step 3. On /etc/pam.d/sshd, add pam_listfile.so with sufficient flag before pam_slurm.so (e.g. my /etc/pam.d/sshd looks like this):

#%PAM-1.0
auth       required     pam_sepermit.so
auth       include      password-auth
account    sufficient   pam_listfile.so item=user sense=allow file=/etc/ssh/allowed_users onerr=fail
account    required     pam_slurm.so
account    required     pam_nologin.so
account    include      password-auth
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    optional     pam_keyinit.so force revoke
session    include      password-auth

(Information courtesy of Koji Tanaka, Indiana University)

8. How can I dry up the workload for a maintenance period?
Create a resource reservation as described b. Slurm's Resource Reservation Guide.

9. How can PAM be used to control a user's limits on or access to compute nodes?
You will need to build and install Slurm including it's PAM module (a slurm_pam package is provided, the code is located in the contribs/pam directory). First, enable Slurm's use of PAM by setting UsePAM=1 in slurm.conf.
Second, establish PAM configuration file(s) for Slurm in /etc/pam.conf or the appropriate files in the /etc/pam.d directory (e.g. /etc/pam.d/sshd by adding the line "account required pam_slurm.so". A basic configuration you might use is:

account  required  pam_unix.so
account  required  pam_slurm.so
auth     required  pam_localuser.so
session  required  pam_limits.so

Third, set the desired limits in /etc/security/limits.conf. For example, to set the locked memory limit to unlimited for all users:

*   hard   memlock   unlimited
*   soft   memlock   unlimited

Finally, you need to disable Slurm's forwarding of the limits from the session from which the srun initiating the job ran. By default all resource limits are propagated from that session. For example, adding the following line to slurm.conf will prevent the locked memory limit from being propagated:PropagateResourceLimitsExcept=MEMLOCK.

We also have a PAM module for Slurm that prevents users from logging into nodes that they have not been allocated (except for user root, which can always login. This pam_slurm module is included with the Slurm distribution. The module is built by default, but can be disabled using the .rpmmacros option "%_without_pam 1" or by entering the command line option "--without pam" when the configure program is executed. It's source code is in the directory "contribs/pam". The use of pam_slurm does not require UsePAM being set. The two uses of PAM are independent.

10. Why are jobs allocated nodes and then unable to initiate programs on some nodes?
This typically indicates that the time on some nodes is not consistent with the node on which the slurmctld daemon executes. In order to initiate a job step (or batch job), the slurmctld daemon generates a credential containing a time stamp. If the slurmd daemon receives a credential containing a time stamp later than the current time or more than a few minutes in the past, it will be rejected. If you check in the SlurmdLog on the nodes of interest, you will likely see messages of this sort: "Invalid job credential from <some IP address>: Job credential expired." Make the times consistent across all of the nodes and all should be well.

11. Why does slurmctld log that some nodes are not responding even if they are not in any partition?
The slurmctld daemon periodically pings the slurmd daemon on every configured node, even if not associated with any partition. You can control the frequency of this ping with the SlurmdTimeout configuration parameter in slurm.conf.

12. How should I relocate the primary or backup controller?
If the cluster's computers used for the primary or backup controller will be out of service for an extended period of time, it may be desirable to relocate them. In order to do so, follow this procedure:

  1. Stop all Slurm daemons
  2. Modify the ControlMachine, ControlAddr, BackupController, and/or BackupAddr in the slurm.conf file
  3. Distribute the updated slurm.conf file to all nodes
  4. Copy the StateSaveLocation directory to the new host and make sure the permissions allow the SlurmUser to read and write it.
  5. Restart all Slurm daemons

There should be no loss of any running or pending jobs. Insure that any nodes added to the cluster have a current slurm.conf file installed. CAUTION: If two nodes are simultaneously configured as the primary controller (two nodes on which ControlMachine specify the local host and the slurmctld daemon is executing on each), system behavior will be destructive. If a compute node has an incorrect ControlMachine or BackupController parameter, that node may be rendered unusable, but no other harm will result.

13. Can multiple Slurm systems be run in parallel for testing purposes?
Yes, this is a great way to test new versions of Slurm. Just install the test version in a different location with a different slurm.conf. The test system's slurm.conf should specify different pathnames and port numbers to avoid conflicts. The only problem is if more than one version of Slurm is configured with switch/nrt or burst_buffer/* plugins. In that case, there can be conflicting API requests from the different Slurm systems. This can be avoided by configuring the test system with switch/none and burst_buffer/none. MPI jobs started on an NRT switch system without the switch windows configured will not execute properly, but other jobs will run fine.

14. Can Slurm emulate a larger cluster?
Yes, this can be useful for testing purposes. It has also been used to partition "fat" nodes into multiple Slurm nodes. There are two ways to do this. The best method for most conditions is to run one slurmd daemon per emulated node in the cluster as follows.

  1. When executing the configure program, use the option --enable-multiple-slurmd (or add that option to your ~/.rpmmacros file).
  2. Build and install Slurm in the usual manner.
  3. In slurm.conf define the desired node names (arbitrary names used only by Slurm. as NodeName along with the actual address of the physical node in NodeHostname. Multiple NodeName values can be mapped to a single NodeHostname. Note that each NodeName on a single physical node needs to be configured to use a different port number (set Port to a unique value on each line for each node). You will also want to use the "%n" symbol in slurmd related path options in slurm.conf (SlurmdLogFile and SlurmdPidFile).
  4. When starting the slurmd daemon, include the NodeName of the node that it is supposed to serve on the execute line (e.g. "slurmd -N hostname").
  5. This is an example of the slurm.conf file with the emulated nodes and ports configuration. Any valid value for the CPUs, memory or other valid node resources can be specified.
NodeName=dummy26[1-100] NodeHostName=achille Port=[6001-6100] NodeAddr=127.0.0.1 CPUs=4 RealMemory=6000
PartitionName=mira Default=yes Nodes=dummy26[1-100]

It is strongly recommended that Slurm version 1.2 or higher be used for this due to its improved support for multiple slurmd daemons. See the Programmers Guide for more details about configuring multiple slurmd support.

In order to emulate a really large cluster, it can be more convenient to use a single slurmd daemon. That daemon will not be able to launch many tasks, but can suffice for developing or testing scheduling software. Do not run job steps with more than a couple of tasks each or execute more than a few jobs at any given time. Doing so may result in the slurmd daemon exhausting its memory and failing. Use this method with caution.

  1. Execute the configure program with your normal options plus --enable-front-end (this will define HAVE_FRONT_END in the resulting config.h file.
  2. Build and install Slurm in the usual manner.
  3. In slurm.conf define the desired node names (arbitrary names used only by Slurm. as NodeName along with the actual name and address of the one physical node in NodeHostName and NodeAddr. Up to 64k nodes can be configured in this virtual cluster.
  4. Start your slurmctld and one slurmd daemon. It is advisable to use the "-c" option to start the daemons without trying to preserve any state files from previous executions. Be sure to use the "-c" option when switch from this mode too.
  5. Create job allocations as desired, but do not run job steps with more than a couple of tasks.
$ ./configure --enable-debug --enable-front-end --prefix=... --sysconfdir=...
$ make install
$ grep NodeHostName slurm.conf
NodeName=dummy[1-1200] NodeHostName=localhost NodeAddr=127.0.0.1
$ slurmctld -c
$ slurmd -c
$ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
pdebug*      up      30:00  1200   idle dummy[1-1200]
$ cat tmp
#!/bin/bash
sleep 30
$ srun -N200 -b tmp
srun: jobid 65537 submitted
$ srun -N200 -b tmp
srun: jobid 65538 submitted
$ srun -N800 -b tmp
srun: jobid 65539 submitted
$ squeue
JOBID PARTITION  NAME   USER  ST  TIME  NODES NODELIST(REASON)
65537    pdebug   tmp  jette   R  0:03    200 dummy[1-200]
65538    pdebug   tmp  jette   R  0:03    200 dummy[201-400]
65539    pdebug   tmp  jette   R  0:02    800 dummy[401-1200]

15. Can Slurm emulate nodes with more resources than physically exist on the node?
Yes in Slurm version 1.2 or higher. In the slurm.conf file, set FastSchedule=2 and specify any desired node resource specifications (CPUs, Sockets, CoresPerSocket, ThreadsPerCore, and/or TmpDisk). Slurm will use the resource specification for each node that is given in slurm.conf and will not check these specifications against those actually found on the node. The system would best be configured with TaskPlugin=task/none, so that launched tasks can run on any available CPU under operating system control.

16. What does a "credential replayed" error in the SlurmdLogFile indicate?
This error is indicative of the slurmd daemon not being able to respond to job initiation requests from the srun command in a timely fashion (a few seconds). Srun responds by resending the job initiation request. When the slurmd daemon finally starts to respond, it processes both requests. The second request is rejected and the event is logged with the "credential replayed" error. If you check the SlurmdLogFile and SlurmctldLogFile, you should see signs of the slurmd daemon's non-responsiveness. A variety of factors can be responsible for this problem including

  • Diskless nodes encountering network problems
  • Very slow Network Information Service (NIS)
  • The Prolog script taking a long time to complete

In Slurm version 1.2, this can be addressed with the MessageTimeout configuration parameter by setting a value higher than the default 5 seconds. In earlier versions of Slurm, the --msg-timeout option of srun serves a similar purpose.

17. What does "Warning: Note very large processing time" in the SlurmctldLogFile indicate?
This error is indicative of some operation taking an unexpectedly long time to complete, over one second to be specific. Setting the value of SlurmctldDebug configuration parameter a value of six or higher should identify which operation(s) are experiencing long delays. This message typically indicates long delays in file system access (writing state information or getting user information). Another possibility is that the node on which the slurmctld daemon executes has exhausted memory and is paging. Try running the program top to check for this possibility.

18. How can I add support for lightweight core files?
Slurm supports lightweight core files by setting environment variables based upon the srun --core option. Of particular note, it sets the LD_PRELOAD environment variable to load new functions used to process a core dump. First you will need to acquire and install a shared object library with the appropriate functions. Then edit the Slurm code in src/srun/core-format.c to specify a name for the core file type, add a test for the existence of the library, and set environment variables appropriately when it is used.

19. Is resource limit propagation useful on a homogeneous cluster?
Resource limit propagation permits a user to modify resource limits and submit a job with those limits. By default, Slurm automatically propagates all resource limits in effect at the time of job submission to the tasks spawned as part of that job. System administrators can utilize the PropagateResourceLimits and PropagateResourceLimitsExcept configuration parameters to change this behavior. Users can override defaults using the srun --propagate option. See "man slurm.conf" and "man srun" for more information about these options.

20. Do I need to maintain synchronized clocks on the cluster?
In general, yes. Having inconsistent clocks may cause nodes to be unusable. Slurm log files should contain references to expired credentials. For example:

error: Munge decode failed: Expired credential
ENCODED: Wed May 12 12:34:56 2008
DECODED: Wed May 12 12:01:12 2008

21. Why are "Invalid job credential" errors generated?
This error is indicative of Slurm's job credential files being inconsistent across the cluster. All nodes in the cluster must have the matching public and private keys as defined by JobCredPrivateKey and JobCredPublicKey in the slurm configuration file slurm.conf.

22. Why are "Task launch failed on node ... Job credential replayed" errors generated?
This error indicates that a job credential generated by the slurmctld daemon corresponds to a job that the slurmd daemon has already revoked. The slurmctld daemon selects job ID values based upon the configured value of FirstJobId (the default value is 1) and each job gets a value one larger than the previous job. On job termination, the slurmctld daemon notifies the slurmd on each allocated node that all processes associated with that job should be terminated. The slurmd daemon maintains a list of the jobs which have already been terminated to avoid replay of task launch requests. If the slurmctld daemon is cold-started (with the "-c" option or "/etc/init.d/slurm startclean"), it starts job ID values over based upon FirstJobId. If the slurmd is not also cold-started, it will reject job launch requests for jobs that it considers terminated. This solution to this problem is to cold-start all slurmd daemons whenever the slurmctld daemon is cold-started.

23. Can Slurm be used with Globus?
Yes. Build and install Slurm's Torque/PBS command wrappers along with the Perl APIs from Slurm's contribs directory and configure Globus to use those PBS commands. Note there are RPMs available for both of these packages, named torque and perlapi respectively.

24. What causes the error "Unable to accept new connection: Too many open files"?
The srun command automatically increases its open file limit to the hard limit in order to process all of the standard input and output connections to the launched tasks. It is recommended that you set the open file hard limit to 8192 across the cluster.

25. Why does the setting of SlurmdDebug fail to log job step information at the appropriate level?
There are two programs involved here. One is slurmd, which is a persistent daemon running at the desired debug level. The second program is slurmstep, which executed the user job and its debug level is controlled by the user. Submitting the job with an option of --debug=# will result in the desired level of detail being logged in the SlurmdLogFile plus the output of the program.

26. Why isn't the auth_none.so (or other file) in a Slurm RPM?
The auth_none plugin is in a separate RPM and not built by default. Using the auth_none plugin means that Slurm communications are not authenticated, so you probably do not want to run in this mode of operation except for testing purposes. If you want to build the auth_none RPM then add --with auth_none on the rpmbuild command line or add %_with_auth_none to your ~/rpmmacros file. See the file slurm.spec in the Slurm distribution for a list of other options.

27. Why should I use the slurmdbd instead of the regular database plugins?
While the normal storage plugins will work fine without the added layer of the slurmdbd there are some great benefits to using the slurmdbd.

  1. Added security. Using the slurmdbd you can have an authenticated connection to the database.
  2. Off loading processing from the controller. With the slurmdbd there is no slow down to the controller due to a slow or overloaded database.
  3. Keeping enterprise wide accounting from all Slurm clusters in one database. The slurmdbd is multi-threaded and designed to handle all the accounting for the entire enterprise.
  4. With the new database plugins 1.3+ you can query with sacct accounting stats from any node Slurm is installed on. With the slurmdbd you can also query any cluster using the slurmdbd from any other cluster's nodes.

28. How can I build Slurm with debugging symbols?
Set your CFLAGS environment variable before building. You want the "-g" option to produce debugging information and "-O0" to set the optimization level to zero (off). For example:
CFLAGS="-g -O0" ./configure ...

29. How can I easily preserve drained node information between major Slurm updates?
Major Slurm updates generally have changes in the state save files and communication protocols, so a cold-start (without state) is generally required. If you have nodes in a DRAIN state and want to preserve that information, you can easily build a script to preserve that information using the sinfo command. The following command line will report the Reason field for every node in a DRAIN state and write the output in a form that can be executed later to restore state.

sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'"

30. Why doesn't the HealthCheckProgram execute on DOWN nodes?
Hierarchical communications are used for sending this message. If there are DOWN nodes in the communications hierarchy, messages will need to be re-routed. This limit. Slurm's ability to tightly synchronize the execution of the HealthCheckProgram across the cluster, which could adversely impact performance of parallel applications. The use of CRON or node startup scripts may be better suited to insure that HealthCheckProgram gets executed on nodes that are DOWN in Slurm. If you still want to have Slurm try to execute HealthCheckProgram on DOWN nodes, apply the following patch:

Index: src/slurmctld/ping_nodes.c
===================================================================
--- src/slurmctld/ping_nodes.c  (revision 15166)
+++ src/slurmctld/ping_nodes.c  (working copy)
@@ -283,9 +283,6 @@
		node_ptr   = &node_record_table_ptr[i];
		base_state = node_ptr->node_state & NODE_STATE_BASE;

-               if (base_state == NODE_STATE_DOWN)
-                       continue;
-
 #ifdef HAVE_FRONT_END          /* Operate only on front-end */
		if (i > 0)
			continue;

31. What is the meaning of the error "Batch JobId=# missing from master node, killing it"?
A shell is launched on node zero of a job's allocation to execute the submitted program. The slurmd daemon executing on each compute node will periodically report to the slurmctld what programs it is executing. If a batch program is expected to be running on some node (i.e. node zero of the job's allocation) and is not found, the message above will be logged and the job canceled. This typically is associated with exhausting memory on the node or some other critical failure that cannot be recovered from. The equivalent message in earlier releases of Slurm is "Master node lost JobId=#, killing it".

32. What does the message "srun: error: Unable to accept connection: Resources temporarily unavailable" indicate?
This has been reported on some larger clusters running SUSE Linux when a user's resource limits are reached. You may need to increase limits for locked memory and stack size to resolve this problem.

33. How could I automatically print a job's Slurm job ID to its standard output?
The configured TaskProlog is the only thing that can write to the job's standard output or set extra environment variables for a job or job step. To write to the job's standard output, precede the message with "print ". To export environment variables, output a line of this form "export name=value". The example below will print a job's Slurm job ID and allocated hosts for a batch job only.

#!/bin/sh
#
# Sample TaskProlog script that will print a batch job's
# job ID and node list to the job's stdout
#

if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
  echo "print =========================================="
  echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
  echo "print SLURM_NODELIST = $SLURM_NODELIST"
  echo "print =========================================="
fi

34. Why are user processes and srun running even though the job is supposed to be completed?
Slurm relies upon a configurable process tracking plugin to determine when all of the processes associated with a job or job step have completed. Those plugins relying upon a kernel patch can reliably identify every process. Those plugins dependent upon process group IDs or parent process IDs are not reliable. See the ProctrackType description in the slurm.conf man page for details. We rely upon the sgi_job for most systems.

35. How can I prevent the slurmd and slurmstepd daemons from being killed when a node's memory is exhausted?
You can set the value in the /proc/self/oom_adj for slurmd and slurmstepd by initiating the slurmd daemon with the SLURMD_OOM_ADJ and/or SLURMSTEPD_OOM_ADJ environment variables set to the desired values. A value of -17 typically will disable killing.

36. I see my host of my calling node as 127.0.1.1 instead of the correct IB address. Why is that?
Some systems by default will put your host in the /etc/hosts file as something like

127.0.1.1	snowflake.llnl.gov	snowflake

This will cause srun and Slurm commands to use the 127.0.1.1 address instead of the correct address and prevent communications between nodes. The solution is to either remove this line or configure a different NodeAddr that is known by your other nodes.

The TopologyParam=NoInAddrAny configuration parameter is subject to this same problem, which can also be addressed by removing actual node name from the "127.0.1.1" as well as the "127.0.0.1" addresses in the /etc/hosts file. It is ok if they point to localhost, but not the actual name of the node.

37. How can I stop Slurm from scheduling jobs?
You can stop Slurm from scheduling jobs on a per partition basis by setting that partition's state to DOWN. Set its state UP to resume scheduling. For example:

$ scontrol update PartitionName=foo State=DOWN
$ scontrol update PartitionName=bar State=UP

38. Can I update multiple jobs with a single scontrol command?
No, but you can probably use squeue to build the script taking advantage of its filtering and formatting options. For example:

$ squeue -tpd -h -o "scontrol update jobid=%i priority=1000" >my.script

39. Can Slurm be used to run jobs on Amazon's EC2?

Yes, here is a description of use Slurm use with Amazon's EC2 courtesy of Ashley Pittman:

I do this regularly and have no problem with it, the approach I take is to start as many instances as I want and have a wrapper around ec2-describe-instances that builds a /etc/hosts file with fixed hostnames and the actual IP addresses that have been allocated. The only other step then is to generate a slurm.conf based on how many node you've chosen to boot that day. I run this wrapper script on my laptop and it generates the files and they rsyncs them to all the instances automatically.

One thing I found is tha. Slurm refuses to start if any nodes specified in the slurm.conf file aren't resolvable, I initially tried to specify cloud[0-15] in slurm.conf, but then if I configure less than 16 nodes in /etc/hosts this doesn't work so I dynamically generate the slurm.conf as well as the hosts file.

As a comment about EC2 I run just run generic AMIs and have a persistent EBS storage device which I attach to the first instance when I start up. This contains a /usr/local which has my software like Slurm, pdsh and MPI installed which I then copy over the /usr/local on the first instance and NFS export to all other instances. This way I have persistent home directories and a very simple first-login script that configures the virtual cluster for me.

40. If a Slurm daemon core dumps, where can I find the core file?

For slurmctld, the core file will be in the same directory as its log files (SlurmctldLogFile) if configured using an fully qualified pathname (starting with "/"). Otherwise it will be found in directory used for saving state (StateSaveLocation).

For slurmd, the core file will be in the same directory as its log files (SlurmdLogFile) if configured using an fully qualified pathname (starting with "/"). Otherwise it will be found in directory used for saving state (SlurmdSpoolDir).

For slurmstepd, the core file will depend upon when the failure occurs. It will either be in spawned job's working directory on the same location as that described above for the slurmd daemon.

NOTE: On some systems, the slurmstepd's will not generate core files without some system configuration changes due to its use of the setuid (set user ID) function.
Set /proc/sys/fs/suid_dumpable to 2.
This could be set in permently in sysctl.conf with:
fs.suid_dumpable = 2
or temporarily with:
sysctl fs.suid_dumpable=2
On Centos 6, also set "ProcessUnpackaged = yes" in the file /etc/abrt/abrt-action-save-package-data.conf. On Red Hat EL6, also set "DAEMON_COREFILE_LIMIT=unlimited" in the file rc.d/init.d/functions.

Once these configuration changes have been made and the slurmstepd aborts, you should see message of this type in the file /var/log/messages:

Oct 15 11:31:20 knc abrt[21489]: Saved core dump of pid 21477 (/localhome/adam/slurm/16.05/knc/sbin/slurmstepd) to /var/spool/abrt/ccpp-2015-10-15-11:31:20-21477 (6639616 bytes)
Oct 15 11:31:20 knc abrtd: Directory 'ccpp-2015-10-15-11:31:20-21477' creation detected

There should be a core file inside the specified directory.

On a 3.6 kernel (Ubuntu), fs.suid_dumpable requires a fully qualified path in the core_pattern. For example:
sysctl kernel.core_pattern=/tmp/core.%e.%p

41. How can TotalView be configured to operate with Slurm?

The following lines should also be added to the global .tvdrc file for TotalView to operate with Slurm:

# Enable debug server bulk launch: Checked
dset -set_as_default TV::bulk_launch_enabled true

# Command:
# Beginning with TV 7X.1, TV supports Slurm and %J.
# Specify --mem-per-cpu=0 in case Slurm configured with default memory
# value and we want TotalView to share the job's memory limit without
# consuming any of the job's memory so as to block other job steps.
dset -set_as_default TV::bulk_launch_string {srun --mem-per-cpu=0 -N%N -n%N -w`awk -F. 'BEGIN {ORS=","} {if (NR==%N) ORS=""; print $1}' %t1` -l --input=none %B/tvdsvr%K -callback_host %H -callback_ports %L -set_pws %P -verbosity %V -working_directory %D %F}

# Temp File 1 Prototype:
# Host Lines:
# Slurm NodeNames need to be unadorned hostnames. In case %R returns
# fully qualified hostnames, list the hostnames in %t1 here, and use
# awk in the launch string above to strip away domain name suffixes.
dset -set_as_default TV::bulk_launch_tmpfile1_host_lines {%R}

42. How can a patch file be generated from a Slurm commit in github?

Find and open the commit in github then append ".patch" to the URL and save the resulting file. For an example, see: https://github.com/SchedMD/slurm/commit/91e543d433bed11e0df13ce0499be641774c99a3.patch

43. Why are the resource limits set in the database not being enforced?
In order to enforce resource limits, set the value of AccountingStorageEnforce in each cluster's slurm.conf configuration file appropriately. If AccountingStorageEnforce does not contains an option of "limits", then resource limits will not be enforced on that cluster. See Resource Limits for more information.

44. After manually setting a job priority value, how can it's priority value be returned to being managed by the priority/multifactor plugin?
Hold and then release the job as shown below.

$ scontrol hold <jobid>
$ scontrol release <jobid>

45. Does any one have an example node health check script for Slurm?
Probably the most comprehensive and lightweight health check tool out there is Node Health Check. It has integration with Slurm as well as Torque resource managers.

46. What process should I follow to add nodes to Slurm?
The slurmctld daemon has a multitude of bitmaps to track state of nodes and cores in the system. Adding nodes to a running system would require the slurmctld daemon to rebuild all of those bitmaps, which the developers feel would be safer to do by restarting the daemon. Communications from the slurmd daemons on the compute nodes to the slurmctld daemon include a configuration file checksum, so you probably also want to maintain a common slurm.conf file on all nodes. The following procedure is recommended:

  1. Stop the slurmctld daemon (e.g. "/etc/init.d/slurm stop" on the head node)
  2. Update the slurm.conf file on all nodes in the cluster
  3. Restart the slurmctld daemon (e.g. "/etc/init.d/slurm start" on the head node)
  4. Start the slurmd daemons on the new nodes (e.g. "/etc/init.d/slurm start" on those node)
  5. Have all slurmd daemons read the new configuration file (e.g. "scontrol reconfig", no need to restart the daemons)
NOTE: Jobs submitted with srun, and that are waiting for an allocation, prior to new nodes being added to the slurm.conf can fail if the job is allocated one of the new nodes.

47. Can Slurm be configured to manage licenses?
Slurm is not currently integrated with FlexLM, but it does provide for the allocation of global resources called licenses. Use the Licenses configuration parameter in your slurm.conf file (e.g. "Licenses=foo:10,bar:20"). Jobs can request licenses and be granted exclusive use of those resources (e.g. "sbatch --licenses=foo:2,bar:1 ..."). It is not currently possible to change the total number of licenses on a system without restarting the slurmctld daemon, but it is possible to dynamically reserve licenses and remove them from being available to jobs on the system (e.g. "scontrol update reservation=licenses_held licenses=foo:5,bar:2").

48. Can the salloc command be configured to launch a shell on a node in the job's allocation?
Yes, just use the SallocDefaultCommand configuration parameter in your slurm.conf file as shown below.

SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"

For cray systems, add --gres=craynetwork:0 to the options.

SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"

49. What should I be aware of when upgrading Slurm?
See the Quick Start Administrator Guide Upgrade section for details.

50. How easy is it to switch from PBS or Torque to Slurm?
A lot of users don't even notice the difference. Slurm has wrappers available for the mpiexec, pbsnodes, qdel, qhold, qrls, qstat, and qsub commands (see contribs/torque in the distribution and the "slurm-torque" RPM). There is also a wrapper for the showq command at https://github.com/pedmon/slurm_showq.

Slurm recognizes and translates the "#PBS" options in batch scripts. Most, but not all options are supported.

Slurm also includes a SPANK plugin that will set all of the PBS environment variables based upon the Slurm environment (e.g. PBS_JOBID, PBS_JOBNAME, PBS_WORKDIR, etc.). One environment not set by PBS_ENVIRONMENT, which if set would result in the failure of some MPI implementations. The plugin will be installed in
<install_directory>/lib/slurm/spank_pbs.so
See the SPANK man page for configuration details.

51. I am having trouble using SSSD with Slurm.
SSSD or System Security Services Deamon does not allow enumeration of group members by default. Note that enabling enumeration in large environments might not be feasible. However, as of version 16.05 slurm does not need enumeration, except for some specific quirky configurations (multiple groups with the same GID), so probably it's perfectly safe to leave enumeration disabled. SSSD is also case sensitive by default for some configurations, which could possibly raise other issues. Add the following lines to /etc/sssd/sssd.conf on your head node to address these issues:

enumerate = True
case_sensitive = False

52. How critical is configuring high availability for my database?

  • Consider if you really need mysql failover. Short outage of slurmdbd is not a problem, because slurmctld will store all data in memory and send it to slurmdbd when it's back operating. The slurmctld daemon will also cache all user limits and fair share information.
  • You cannot use ndb, since slurmdbd/mysql uses a keys on BLOB values (and maybe something more from the incompatibility list).
  • You can set up "classical" Linux HA, with heartbeat/corosync to migrate IP between master/backup mysql servers and:
    • Configure one way replication of mysql, and change master/backup roles on failure
    • Use shared storage for master/slave mysql servers database, and start backup on master mysql failure.

53.How can I use double quotes in MySQL queries?
Execute:

SET session sql_mode='ANSI_QUOTES';

This will allow double quotes in queries like this:

show columns from "tux_assoc_table" where Field='is_def';

54. Why is a compute node down with the reason set to "Node unexpectedly rebooted"?
This is indicative of the slurmctld daemon running on the cluster's head node as well as the slurmd daemon on the compute node when the compute node reboots. If you want to prevent this condition from setting the node into a DOWN state then configure ReturnToService to 2. See the slurm.conf man page for details. Otherwise use the scontrol or sview to manually return the node to service.

55. How can a job which has exited with a specific exit code be requeued?
Slurm supports requeue in hold with a SPECIAL_EXIT state using the command:

scontrol requeuehold State=SpecialExit job_id

This is useful when users want to requeue and flag a job which has exited with a specific error case. See man scontrol(1) for more details.

$->scontrol requeuehold State=SpecialExit 10
$->squeue
   JOBID PARTITION  NAME     USER  ST       TIME  NODES NODELIST(REASON)
    10      mira    zoppo    david SE       0:00      1 (JobHeldUser)

The job can be later released and run again.

The requeueing of jobs which exit with a specific exit code can be automated using an EpilogSlurmctld, see man(5) slurm.conf. This is an example of a script which exit code depends on the existence of a file.

$->cat exitme
#!/bin/sh
#
echo "hi! `date`"
if [ ! -e "/tmp/myfile" ]; then
  echo "going out with 8"
  exit 8
fi
rm /tmp/myfile
echo "going out with 0"
exit 0

This is an example of an EpilogSlurmctld that checks the job exit value looking at the SLURM_JOB_EXIT2 environment variable and requeues a job if it exited with value 8. The SLURM_JOB_EXIT2 has the format "exit:sig", the first number is the exit code, typically as set by the exit() function. The second number of the signal that caused the process to terminate if it was terminated by a signal.

$->cat slurmctldepilog
#!/bin/sh

export PATH=/bin:/home/slurm/linux/bin
LOG=/home/slurm/linux/log/logslurmepilog

echo "Start `date`" >> $LOG 2>&1
echo "Job $SLURM_JOB_ID exitcode $SLURM_JOB_EXIT_CODE2" >> $LOG 2>&1
exitcode=`echo $SLURM_JOB_EXIT_CODE2|awk '{split($0, a, ":"); print a[1]}'` >> $LOG 2>&1
if [ "$exitcode" == "8" ]; then
   echo "Found REQUEUE_EXIT_CODE: $REQUEUE_EXIT_CODE" >> $LOG 2>&1
   scontrol requeuehold state=SpecialExit $SLURM_JOB_ID >> $LOG 2>&1
   echo $? >> $LOG 2>&1
else
   echo "Job $SLURM_JOB_ID exit all right" >> $LOG 2>&1
fi
echo "Done `date`" >> $LOG 2>&1

exit 0

Using the exitme script as an example we have it to exit with value 8 at the first run, then when it gets requeued in hold with SpecialExit state we touch the file /tmp/myfile, then release the job which will finish in COMPLETE state.

56. Can a user's account be changed in the database?
A user's account can not be changed directly. A new association needs to be created for the user with the new account. Then the association with the old account can be deleted.

# Assume user "adam" is initially in account "physics"
sacctmgr create user name=adam cluster=tux account=physics
sacctmgr delete user name=adam cluster=tux account=chemistry

57. What might account for MPI performance being below the expected level?
Starting the slurmd daemons with limited locked memory can account for this. Adding the line "ulimit -l unlimited" to /etc/sysconfig/slurm file can fix this.

58. How could some jobs submitted immediately before the slurmctld daemon crashed be lost?
Any time the slurmctld daemon or hardware fails before state information reaches disk can result in lost state. Slurmctld writes state frequently (every five seconds by default), but with large numbers of jobs, the formatting and writing of records can take seconds and recent changes might not be written to disk. Another example is if the state information written to file, but that information is cached in memory rather than written to disk when the node fails. The interval between state saves being written to disk can be configured at build time by defining SAVE_MAX_WAIT to a different value than five.

59. How do I safely remove partitions?
Partitions should be removed using the "scontrol delete PartitionName=<partition>" command. This is because scontrol will prevent any partitions from being removed that are in use. Partitions need to be removed from the slurm.conf after being removed using scontrol or they will return after a restart. An existing job's partition(s) can be updated with the "scontrol update JobId=<jobid> Partition=<partition(s)>" command. Removing a partition from the slurm.conf and restarting will cancel any existing jobs that reference the removed partitions.

60. Why is Slurm unable to set the CPU frequency for jobs?
First check that Slurm is configured to bind jobs to specific CPUs by making sure that TaskPlugin is configured to either affinity or cgroup. Next check that that your processor is configured to permit frequency control by examining the values in the file /sys/devices/system/cpu/cpu0/cpufreq where "cpu0" represents a CPU ID 0. Of particular interest is the file scaling_available_governors, which identifies the CPU governors available. If "userspace" is not an available CPU governor, this may well be due to the intel_pstate driver being installed. Information about disabling the intel_pstate driver is available from
https://bugzilla.kernel.org/show_bug.cgi?id=57141 and
http://unix.stackexchange.com/questions/121410/setting-cpu-governor-to-on-demand-or-conservative.

61. How can Slurm be configured to support Intel Xeon Phi (MIC)?
Users should see the Xeon Phi use information above. Slurm configuration details for Xeon Phi offload support are available in Slurm's Generic Resource Guide.

For native mode, slurmd (built for k1om) is started inside the card when the card is booted. The Slurm configuration file, slurm.conf, is the same as on regular compute nodes (by default it is mounted on all regular nodes and all Xeon Phi "nodes" from the same place). Use the "slurmd -C" command to determine the Xeon Phi node configuration with respect to cores, threads per core, memory, etc. The Xeon Phi is by default placed on host's network and connected via the bridge to the rest of the cluster, therefore from a Slurm user's perspective the Xeon Phi can look like one more compute node with a lot of CPUs.

Therefore an administrator can use configure the Xeon Phi as a regular node (with the slurmd daemon running on it), as a generic resources (for offload mode), or both. Although, Slurm cannot handle the last case nicely since it does not recognize the Xeon Phi compute node and generic resource represent the same resources. So from administrator prospective, the process of MICs configuration can be: install the latest MPSS and Slurm packages from yum/zypper, add new MICs (via console utility or GUI), add MICs to Slurm queues if necessary, restart the host, use MICs via Slurm.

62. When adding a new cluster, how can the Slurm cluster configuration be copied from an existing cluster to the new cluster?
Accounts need to be configured the cluster. An easy way to copy information from an existing cluster is to use the sacctmgr command to dump that cluster's information, modify it using some editor, the load the new information using the sacctmgr command. See the sacctmgr man page for details, including an example.

63. How can I update Slurm on a Cray DVS file system without rebooting the nodes?
The problem with DVS caching is related to the fact that the dereferenced value of /opt/slurm/default symlink is cached in the DVS attribute cache, and that cache is not dropped when the rest of the VM caches are.

The Cray Native Slurm installation manual indicates that slurm should have a "default" symlink run through /etc/alternatives. As an alternative to that:

  1. Institute a policy that all changes to files which could be open persistently (i.e., .so files) are always modified by creating a new access path. I.e., installations go to a new directory.
  2. Dump the /etc/alternatives stuff, just use a regular symlink, e.g., default points to 15.8.0-1.
  3. Add a new mountpoint on all the compute nodes for /dsl/opt/slurm where the attrcache_timeout attribute is reduced from 14440s to 60s (or 15s -- whatever):
    mount -t dvs /opt/slurm /dsl/opt/slurm -o
    path=/dsl/opt/slurm,nodename=c0-0c0s0n0,loadbalance,cache,ro,attrcache_timeout=15
    In the example above, c0-0c0s0n0 is the single DVS server for the system.

Using this strategy avoids the caching problems, making upgrades simple. One just has to wait for about 20 seconds after changing the default symlinks before starting the slurmds again.

(Information courtesy of Douglas Jacobsen, NERSC, Lawrence Berkeley National Laboratory)

64. How can I rebuild the database hierarchy?
If you see errors of this sort:

error: Can't find parent id 3358 for assoc 1504, this should never happen.

in the slurmctld log file, this is indicative that the database hierarchy information has been corrupted, typically due to a hardware failure of administrator error in directly modifying the database. In order to rebuild the database information, start the slurmdbd daemon with the "-R" option followed by an optional comma separated list of cluster names to operate on.

65. How can a routing queue be configured?
A job submit plugin is designed to have access to a job request from a user, plus information about all of the available system partitions/queue. An administrator can write a C plugin or LUA script to set an incoming job's partition based upon its size, time limit, etc. See the Job Submit Plugin API guide for more information. Also see the available job submit plugins distributed with Slurm for examples (look in the "src/plugins/job_submit" directory).

66. How can I suspend, resume, hold or release all of the jobs belonging to a speciic user, partition, etc?
There isn't any filtering by user, partition, etc. available in the scontrol command; however the squeue command can be used to perform the filtering and build a script which you can then execute. For example:

> squeue -u adam -h -o "scontrol hold %i" >hold_script

67. I had to change a user's UID and now they cannot submit jobs. How do I get the new UID to take effect?
When changing UIDs, you will also need to restart the slurmctld for the changes to take effect. Normally, when adding a new user to the system, the UID is filled in automatically and immediately. If the user isn't known on the system yet, there is a thread that runs every hour that fills in those UIDs when they become known, but it doesn't recognize UID changes of preexisting users. But you can simply restart the slurmctld for those changes to be recognized.

Last modified 12 May 2017