Metrics Guide
Contents
- Metrics Overview
- Configuration
- HTTP Endpoints
- OpenMetrics Plugin
- Metric Categories Provided by Slurm
- Security Considerations
- Performance Impact
- Usage Examples
Metrics Overview
Slurm 25.11 introduced a comprehensive system for collecting and exposing metrics related to cluster resources, job states, and scheduler performance. The metrics system exposes real-time data about various Slurm entities through HTTP endpoints provided by the slurmctld daemon.
The metrics feature enables integration with popular monitoring systems like Prometheus, Grafana, and other observability tools.
Configuration
Prerequisites
The metrics feature requires specific configuration in slurm.conf:
- PrivateData must not be set: The metrics feature is disabled if any PrivateData parameter is configured in slurm.conf. This is a security requirement to prevent exposure of sensitive information through metrics.
- MetricsType parameter: Set the
MetricsType parameter to specify
which metrics plugin to use. Currently, only the OpenMetrics plugin is
supported:
MetricsType=metrics/openmetrics
Plugin Loading
The metrics plugin is automatically loaded by slurmctld when the MetricsType parameter is configured.
HTTP Endpoints
Slurm exposes metrics through HTTP GET endpoints on the slurmctld daemon's listening port (default 6817). The following endpoints are available:
- GET /metrics - Print available metric endpoints
- GET /metrics/jobs - Job-related metrics including counts by state, resource allocation, and job statistics (examples)
- GET /metrics/jobs-users-accts - User- and account-specific job metrics (examples)
- GET /metrics/nodes - Node-related metrics including resource counts, states, and utilization (examples)
- GET /metrics/partitions - Partition-related metrics including job counts per partition and resource allocation (examples)
- GET /metrics/scheduler - Scheduler performance metrics including cycle times, backfill statistics, and queue lengths (examples)
All endpoints return data in UTF-8 text format making them compatible with Prometheus and other monitoring systems.
OpenMetrics Plugin
The OpenMetrics plugin implements the OpenMetrics 1.0 specification, ensuring compatibility with Prometheus and other monitoring systems that consume metrics in this format.
Metric Format
Each metric follows the OpenMetrics format with the following components:
- Metric name: A descriptive name prefixed with "slurm_"
- Metric type: Only "gauge" metrics are exposed
- Metric value: The actual numeric value
- Labels: Optional key-value pairs for additional context
- Help text: Human-readable description of the metric
Metric Categories Provided by Slurm
Each endpoint provides a set of metrics related to the same general category. Numerous metrics are provided so they are not all documented on this page. A few examples are provided for each category in the following subsections.
Job Metrics
Job metrics provide information about job states, resource allocation, and job counts. Examples include:
slurm_jobs- Total number of jobsslurm_jobs_running- Number of running jobsslurm_jobs_pending- Number of pending jobs (see note)slurm_jobs_cpus_alloc- Total CPUs allocated to jobsslurm_jobs_memory_alloc- Total memory allocated to jobs
NOTE: In Slurm, pending jobs include both jobs waiting for resources and held jobs. Held jobs will not be scheduled until the hold is released.
User- and Account-Specific Job Metrics
Job metrics for user and accounts provide a count of jobs in each state for each active user and account in the system. It stores each entity under a key-value pair. Remember that every unique key-value pair represents a new time series, which can dramatically increase the amount of data stored. Examples include:
slurm_user_jobs_pending{username="john"}- Pending jobs for user "john"slurm_account_jobs_pending{account="smith"}- Pending jobs for account "smith"
Node Metrics
Node metrics track resource availability, node states, and utilization. Examples include:
slurm_nodes- Total number of nodesslurm_nodes_idle- Number of idle nodesslurm_nodes_alloc- Number of allocated nodesslurm_node_cpus{node="nodename"}- CPUs on the specified nodeslurm_node_memory_bytes{node="nodename"}- Memory on the specified node (bytes)
Partition Metrics
Partition metrics show job distribution and resource allocation across partitions. Examples include:
slurm_partitions- Total number of partitionsslurm_partition_jobs{partition="name"}- Jobs on the specified partitionslurm_partition_nodes{partition="name"}- Nodes on the specified partition
The following metrics might be useful for Slinky or other systems which have an auto-scale feature. By knowing the maximum number of nodes that a job requested in a partition, the decision to extend the nodes of the partition by this number can be considered. Jobs which are held are not included in these metrics.
slurm_partition_jobs_max_job_nodes_nohold- Gives the maximum number of nodes requested by any job from all pending jobs that are not held in the partition.slurm_partition_jobs_min_job_nodes_nohold- Gives the maximum of the minimum number of nodes requested by any job from all pending jobs that are not held in the partition.
Scheduler Metrics
Scheduler metrics provide insights into scheduling performance and behavior. Examples include:
slurm_sched_cycle_cnt- Scheduling cycle countslurm_sched_cycle_last- Last scheduling cycle timeslurm_bf_cycle_cnt- Backfill cycle countslurm_bf_active- Whether the backfill scheduler is currently running
Security Considerations
The metrics system has several important security implications:
- No authentication: There is no built-in authentication mechanism for metrics endpoints. Anyone with network access to the slurmctld port can query metrics.
- PrivateData dependency: The metrics feature is automatically disabled when any PrivateData parameter is set in slurm.conf to prevent exposure of sensitive information.
- Network access: Metrics are exposed through the slurmctld network interface. Consider firewall rules and network segmentation to control access.
- Information disclosure: Metrics may reveal information about cluster utilization, job patterns, and user activity that could be considered sensitive in some environments.
Performance Impact
Metrics collection and exposition can impact slurmctld performance:
- Lock contention: Querying metrics requires acquiring various locks within slurmctld, which can impact scheduler performance during high-frequency queries.
- Data collection overhead: Metrics are collected in real-time from slurmctld's internal data structures, which adds computational overhead.
- Data processing overhead in the external monitoring system: We provide endpoints for unbounded entities like users and accounts metrics. A monitoring system may treat each entity as a new time series, which can dramatically increase the amount of data stored.
- Network I/O: Frequent metric queries generate network traffic and consume slurmctld's network I/O capacity, especially on systems with thousands of jobs, users or accounts.
To minimize performance impact:
- Configure appropriate scrape intervals in monitoring systems (e.g., 60-120 seconds)
- Use caching mechanisms in monitoring systems when possible
- Monitor slurmctld performance when enabling metrics
- Do not use unbounded metric endpoints like /metrics/jobs-users-accts to store data in your monitoring system
Usage Examples
Basic curl Examples
Query job metrics:
$ curl http://slurmctld.example.com:6817/metrics/jobs # HELP slurm_jobs Total number of jobs # TYPE slurm_jobs gauge slurm_jobs 42 # HELP slurm_jobs_running Number of jobs in Running state # TYPE slurm_jobs_running gauge slurm_jobs_running 15 # HELP slurm_jobs_pending Number of jobs in Pending state # TYPE slurm_jobs_pending gauge slurm_jobs_pending 27 ...
Query node metrics:
$ curl http://slurmctld.example.com:6817/metrics/nodes # HELP slurm_nodes Total number of nodes # TYPE slurm_nodes gauge slurm_nodes 100 # HELP slurm_nodes_idle Number of nodes in Idle state # TYPE slurm_nodes_idle gauge slurm_nodes_idle 85 # HELP slurm_nodes_alloc Number of nodes in Allocated state # TYPE slurm_nodes_alloc gauge slurm_nodes_alloc 15 ...
Prometheus Configuration
Configure Prometheus to scrape Slurm metrics by adding the following to your prometheus.yml:
scrape_configs:
- job_name: 'slurm_jobs'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/jobs'
- job_name: 'slurm_nodes'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/nodes'
- job_name: 'slurm_partitions'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/partitions'
- job_name: 'slurm_scheduler'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/scheduler'
- job_name: 'slurm_useracct'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/jobs-users-accts'