Network Configuration Guide
Communication for slurmctld
Communication for slurmdbd
Communication for slurmd
Communication for client commands
Communication for multiple controllers
Communication with multiple clusters
Communication in a federation
There are a lot of components in a Slurm cluster that need to be able to communicate with each other. Some sites have security requirements that prevent them from opening all communications between the machines and will need to be able to selectively open just the ports that are necessary. This document will go over what is needed for different components to be able to talk to each other.
Below is a diagram of a fairly typical cluster, with slurmctld and slurmdbd on separate machines. In smaller clusters, MySQL can run on the same machine as the slurmdbd, but in most cases it is preferable to have it run on a dedicated machine. slurmd runs on the compute nodes and the client commands can be installed and run from machines of your choosing.
The default port used by slurmctld to listen for incoming requests is 6817. This port can be changed with the SlurmctldPort slurm.conf parameter. Slurmctld listens for incoming requests on that port and responds back on the same connection opened by the requestor.
The machine running slurmctld needs to be able to establish outbound connections as well. It needs to communicate with slurmdbd on port 6819 by default (see the slurmdbd section for information on how to change this). It also needs to communicate with slurmd on the compute nodes on port 6818 by default (see the slurmd section for information on how to change this).
The default port used by slurmdbd to listen for incoming requests is 6819. This port can be changed with the DbdPort slurmdbd.conf parameter. Slurmdbd listens for incoming requests on that port and responds back on the same connection opened by the requestor.
The machine running slurmdbd needs to be able to reach the MySQL or MariaDB server on port 3306 by default (the port is configurable on the database side). This port can be changed with the StoragePort slurmdbd.conf parameter. It also needs to be able to initiate a connection to slurmctld on port 6819 by default (see the slurmctld section for information on how to change this).
The default port used by slurmd to listen for incoming requests from slurmctld is 6818. This port can be changed with the SlurmdPort slurm.conf parameter.
The machines running srun also use a range of ports to be able to communicate with slurmstepd. By default these ports are chosen at random from the ephemeral port range, but you can use the SrunPortRange to specify a range of ports from which they can be chosen. This is necessary for login nodes that are behind a firewall.
The machines running slurmd need to be able to establish connections with slurmctld on port 6817 by default (see the slurmctld section for information on how to change this).
The majority of the client commands will communicate with slurmctld on port 6817 by default (see the slurmctld section for information on how to change this) to get the information they need. This includes the following commands:
There are also commands that communicate directly with slurmdbd on port 6819 by default (see the slurmdbd section for information on how to change this). The following commands get information from slurmdbd:
When a user starts a job using srun there has to be a communication path from the machine where srun is called to the node(s) the job is allocated. Communication follows the sequence outlined below:
- 1a. srun sends job allocation request to slurmctld
- 1b. slurmctld grants allocation and returns details
- 2a. srun sends step create request to slurmctld
- 2b. slurmctld responds with step credential
- 3. srun opens sockets for I/O
- 4. srun forwards credential with task info to slurmd
- 5. slurmd forwards request as needed (per fanout)
- 6. slurmd forks/execs slurmstepd
- 7. slurmstepd connects I/O and launches tasks
- 8. On task termination, slurmstepd notifies srun
- 9. srun notifies slurmctld of job termination
- 10. slurmctld verifies termination of all processes via slurmd and releases resources for next job
- 1b. slurmctld grants allocation and returns details
You can configure a secondary slurmctld and/or slurmdbd to serve as a fallback if the primary should go down. The ports involved don't change, but there are additional communication paths that need to be taken into consideration. The client commands need to be able to reach both machines running slurmctld as well as both machines running slurmdbd. Both instances of slurmctld need to be able to reach both instances of slurmdbd and each slurmdbd needs to be able to reach the MySQL server.
Fallback slurmctld and slurmdbd
In environments where multiple slurmctld instances share the same slurmdbd you can configure each cluster to stand on their own and allow users to specify a cluster to submit their jobs to. Ports used by the different daemons don't change, but all instances of slurmctld need to be able to communicate with the same instance of slurmdbd. You can read more about multi cluster configurations in the Multi-Cluster Operation documentation.
Slurm also provides the ability to schedule jobs in a peer-to-peer fashion between multiple clusters, allowing jobs to run on the cluster that has available resources first. The difference in communication needs between this and a multi-cluster configuration is that the two instances of slurmctld need to be able to communicate with each other. There are more details about using a Federation in the documentation.
Last modified 21 October 2020