pam_slurm_adopt

The purpose of this module is to prevent users from sshing into nodes that they do not have a running job on, and to track the ssh connection and any other spawned processes for accounting and to ensure complete job cleanup when the job is completed. This module does this by determining the job which originated the ssh connection. The user's connection is "adopted" into the "external" step of the job.

Installation

Source:

In your Slurm build directory, navigate to slurm/contribs/pam_slurm_adopt/ and run

make && make install
as root. This will place pam_slurm_adopt.a, pam_slurm_adopt.la, and pam_slurm_adopt.so in /lib/security/ (on Debian systems) or /lib64/security/ (on RedHat/SuSE systems).

RPM:

The included slurm.spec will build a slurm-pam_slurm RPM which will install pam_slurm_adopt. Refer to the Quick Start Administrator Guide for instructions on managing an RPM-based install.

PAM Configuration

Add the following line to the appropriate file in /etc/pam.d, such as system-auth or sshd (you may use either the "required" or "sufficient" PAM control flag):

account    sufficient    pam_slurm_adopt.so

The last PAM module in the account stack should be pam_slurm_adopt.so. For example, you might have the following account stack in sshd:

account    required      pam_nologin.so
account    include       password-auth
account    sufficient    pam_slurm_adopt.so

pam_slurm_adopt must be used with the task/cgroup plugin. The pam_systemd module will conflict with pam_slurm_adopt, so you need to disable it. In addition, make sure a different PAM module isn't short-circuiting the account stack before it gets to pam_slurm_adopt.so. For the example above, the following two lines have been commented out in password-auth:

#account    sufficient    pam_localuser.so
#-session   optional      pam_systemd.so

Note: This may involve editing a file that is auto-generated (password-auth in this example). Do not run the config script that generates the file or your changes will be erased.

If you always want to allow access for an administrative group (e.g., wheel), stack the pam_access module after pam_slurm_adopt. A success with pam_slurm_adopt is sufficient to allow access, but the pam_access module can allow others, such as administrative staff, access even without jobs on that node:

account    sufficient    pam_slurm_adopt.so
account    required      pam_access.so

Then edit the pam_access configuration file (/etc/security/access.conf):

+:wheel:ALL
-:ALL:ALL

When access is denied, the user will receive a relevant error message.

pam_slurm_adopt Module Options

This module is configurable. Add these options to the end of the pam_slurm_adopt line in the appropriate file in /etc/pam.d/ (e.g., sshd or system-auth):

account sufficient pam_slurm_adopt.so optionname=optionvalue

This module has the following options:

action_no_jobs
The action to perform if the user has no jobs on the node. Configurable values are:
ignore
Do nothing. Fall through to the next pam module.
deny (default)
Deny the connection.
action_unknown
The action to perform when the user has multiple jobs on the node and the RPC does not locate the source job. If the RPC mechanism works properly in your environment, this option will likely be relevant only when connecting from a login node. Configurable values are:
newest (default)
Pick the newest job on the node. The "newest" job is chosen based on the mtime of the job's step_extern cgroup; asking Slurm would require an RPC to the controller. Thus, the memory cgroup must be in use so that the code can check mtimes of cgroup directories. The user can ssh in but may be adopted into a job that exits earlier than the job they intended to check on. The ssh connection will at least be subject to appropriate limits and the user can be informed of better ways to accomplish their objectives if this becomes a problem.
allow
Let the connection through without adoption.
deny
Deny the connection.
action_adopt_failure
The action to perform if the process is unable to be adopted into any job for whatever reason. If the process cannot be adopted into the job identified by the callerid RPC, it will fall through to the action_unknown code and try to adopt there. A failure at that point or if there is only one job will result in this action being taken. Configurable values are:
allow (default)
Let the connection through without adoption.
deny
Deny the connection.
action_generic_failure
The action to perform if there are certain failures such as the inability to talk to the local slurmd or if the kernel doesn't offer the correct facilities. Configurable values are:
ignore (default)
Do nothing. Fall through to the next pam module.
allow
Let the connection through without adoption.
deny
Deny the connection.
log_level
See SlurmdDebug in slurm.conf for available options. The default log_level is info.

Slurm Configuration

PrologFlags=contain must be set in the slurm.conf. This sets up the "extern" step into which ssh-launched processes will be adopted. You must also enable the task/cgroup plugin in slurm.conf. See the Slurm cgroups guide.

Important

PrologFlags=contain must be in place before using this module. The module bases its checks on local steps that have already been launched. If the user has no steps on the node, such as the extern step, the module will assume that the user has no jobs allocated to the node. Depending on your configuration of the PAM module, you might accidentally deny all user ssh attempts without PrologFlags=contain.

Other Configuration

Verify that UsePAM is set to On in /etc/ssh/sshd_config (it should be on by default).

The UsePAM option in slurm.conf is not related to this module, and usually should not be set on your cluster.

Firewalls, IP Addresses, etc.

slurmd should be accessible on any IP address from which a user might launch ssh. The RPC to determine the source job must be able to reach the slurmd port on that particular IP address. If there is no slurmd on the source node, such as on a login node, it is better to have the RPC be rejected rather than silently dropped. This will allow better responsiveness to the RPC initiator.

Last modified 12 December 2017