Slurm Checkpoint/Restart with BLCR

Overview

Slurm is integrated with Berkeley Lab Checkpoint/Restart (BLCR) in order to provide automatic job checkpoint/restart support. Functionality provided includes:

  1. Checkpoint of whole batch jobs in addition to job steps
  2. Periodic checkpoint of batch jobs and job steps
  3. Restart execution of batch jobs and job steps from checkpoint files
  4. Automatically requeue and restart the execution of batch jobs upon node failure

The general mode of operation is to

  1. Start the job step using the srun_cr command as described below.
  2. Create a checkpoint of srun_cr using BLCR's cr_checkpoint command and cancel the job. srun_cr will automatically checkpoint your job.
  3. Restart srun_cr using BLCR's cr_restart command. The job will be restarted using a newly allocated jobid.

NOTE: checkpoint/blcr cannot restart interactive jobs. It can create checkpoints for both interactive and batch steps, but only batch jobs can be restarted.

NOTE: BLCR operation has been verified with MVAPICH2. Some other MPI implementations should also work.

User Commands

The following document. Slurm changes specific to BLCR support. Basic familiarity with Slurm commands is assumed.

srun

Several options have been added to support checkpoint restart:

  • --checkpoint: Specifies the interval between creating checkpoints of the job step. By default, the job step will have no checkpoints created. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
  • --checkpoint-dir:Specify the directory where the checkpoint image files of a job step will be stored. The default value is the current working directory. Checkpoint files will be of the form "<job_id>.ckpt" for jobs and "<job_id>.<step_id>.ckpt" for job steps.
  • --restart-dir: Specify the directory when the checkpoint image files of a job step will be read from

Environment variables are available for all of these options:

  • SLURM_CHECKPOINT is equivalent to --checkpoint:
  • SLURM_CHECKPOINT_DIR is equivalent to --checkpoint-dir
  • SLURM_RESTART_DIR is equivalent to --restart-dir

The environment variable SLURM_SRUN_CR_SOCKET is used for job step logic to interact with the srun_cr command.

srun_cr

This is a wrapper program for use with Slurm's checkpoint/blcr plugin to checkpoint/restart tasks launched by srun. The design of srun_cr is inspired by mpiexec_cr from MVAPICH2 and cr_restart form BLCR. It is a wrapper around the srun command to enable batch job checkpoint/restart support when used with Slurm's checkpoint/blcr plugin.

The srun_cr execute line options are identical to those of the srun command. See "man srun" for details.

After initialization, srun_cr registers a thread context callback function. Then it forks a process and executes "cr_run --omit srun" with its arguments. cr_run is employed to exclude the srun process from being dumped upon checkpoint. All catchable signals except SIGCHLD sent to srun_cr will be forwarded to the child srun process. SIGCHLD will be captured to mimic the exit status of srun when it exits. Then srun_cr loops waiting for termination of tasks being launched from srun.

The step launch logic of Slurm is augmented to check if srun is running under srun_cr. If true, the environment variable SURN_SRUN_CR_SOCKET should be present, the value of which is the address of a Unix domain socket created and listened to be srun_cr. After launching the tasks, srun tires to connect to the socket and sends the job ID, step ID and the nodes allocated to the step to srun_cr.

Upon checkpoint, srun_cr checks to see if the tasks have been launched. If not srun_cr first forwards the checkpoint request to the tasks by calling the Slurm API slurm_checkpoint_tasks() before dumping its process context.

Upon restart, srun_cr checks to see if the tasks have been previously launched and checkpointed. If true, the environment variable SLURM_RESTART_DIR is set to the directory of the checkpoint image files of the tasks. Then srun is forked and executed again. The environment variable will be used by the srun command to restart execution of the tasks from the previous checkpoint.

sbatch

Several options have been added to support checkpoint restart:

  • --checkpoint: Specifies the interval between periodic checkpoint of a batch job. By default, the job will have no checkpoints created. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
  • --checkpoint-dir:Specify the directory when the checkpoint image files of a batch job will be stored. The default value is the current working directory. Checkpoint files will be of the form "<job_id>.ckpt" for jobs and "<job_id>.<step_id>.ckpt" for job steps.

Environment variables are available for all of these options:

    SLURM_CHECKPOINT is equivalent to --checkpoint:
  • SLURM_CHECKPOINT_DIR is equivalent to --checkpoint-dir

scontrol

scontrol is used to initiate checkpoint/restart requests.

  • scontrol checkpoint create jobid [ImageDir=dir] [MaxWait=seconds]
    Requests a checkpoint on a specific job. For backward compatibility, if a job id is specified, all job steps of it are checkpointed. If a batch job id is specified, the entire job is checkpointed including the batch shell and all running tasks of all job steps. Upon checkpoint, the task launch command must forward the requests to tasks it launched.
    • ImageDir specifies the directory in which to save the checkpoint image files. If specified, this takes precedence over any --checkpoint-dir option specified when the job or job step were submitted.
    • MaxWait specifies the maximum time permitted for a checkpoint request to complete. The request will be considered failed if not completed in this time period.
  • scontrol checkpoint create jobid.stepid [ImageDir=dir] [MaxWait=seconds]
    Requests a checkpoint on a specific job step.
  • scontrol checkpoint restart jobid [ImageDir=dir] [StickToNodes]
    Restart a previously checkpointed batch job.
    • ImageDir specifies the directory from which to read the checkpoint image files.
    • StickToNodes specifies that the job should be restarted on the same set of nodes from which it was previously checkpointed.

Configuration

The following Slurm configuration parameter has been added:

  • JobCheckpointDir Specifies the default directory for storing or reading job checkpoint information. The data stored here is only a few thousand bytes per job and includes information needed to resubmit the job request, not job's memory image. The directory must be readable and writable by SlurmUser, but not writable by regular users. The job memory images may be in a different location as specified by --checkpoint-dir option at job submit time or scontrol's ImageDir option.

Last modified 12 August 2013