SMD

Section: Slurm components (1)
Updated: February 2014
Index

 

NAME

smd - Used to manage failures in a resource allocation.

 

SYNOPSIS

smd    [OPTIONS...] [job_id]

 

DESCRIPTION

Slurm command used to manage failures in a resource allocation.

 

OPTIONS

-c, --show-config
Shows the configuration of smd.
-d, --drain-node node_name
Drains the hosts of the job (Note: Must include reason -R).
-D, --drop_node node_name
Drops the failed or failing host.
-e, --extend-time
Extends the runtime of the job.
-f, --faulty-nodes node_name
Gets the hosts that are failed or failing hosts.
-j, --job_info
Gets the information of the specified job id.
-r, --replace-node node_name
Replaces the drained host with a new one.
-v, --verbose
Prints detailed event logging. Multiple -v's will further increase the verbosity of logging. By default only errors will display.

 

EXAMPLES

See configuration smd.
        > smd -c
        System Configuration:
        ConfigurationFile: /etc/nonstop.conf
        ControllerAddress: localhost
        LibraryDebug: 0
        ControllerPort: 9114
        ReadTimeout: 10000
        WriteTimeout: 10000
        HotSpareCount: "debug:0"
        MaxSpareNodeCount: 10
        TimeLimitDelay: 600
        TimeLimitDrop: 0
        TimeLimitExtend: 2
        UserDrainAllow: "alan,brenda"
        UserDrainDeny: "none"

Replace a failed node in a job allocation and extend its time limit.

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 67
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-3]
       salloc: error: Node failure on tux2
       $ smd -f $SLURM_JOBID
       Job 67 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILED
       $ smd -r tux2 $SLURM_JOBID
       Job 67 got node tux2 replaced with node tux4
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-1,3-4]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

Identify a failing node in a job allocation, drop it from the job allocation, and extend the job time limit.

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 70
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-3]
       $ smd -d tux3 -R "Application X hangs" $SLURM_JOBID
       Job 69 node tux2 is being drained
       $ smd -f
       Job 69 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILING
       $ smd -D tux2 $SLURM_JOBID
       Job 69 node tux2 dropped successfully
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-1,3]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

 

COPYING

Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

 

SEE ALSO

nonstop.conf(5)


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
EXAMPLES
COPYING
SEE ALSO

This document was created by man2html using the manual pages.
Time: 22:44:54 GMT, November 14, 2016