Slurm User Group Meeting 2023

The Slurm User Group Meeting (SLUG'23) this fall will be held in person at Brigham Young University from September 12-13, 2023.

Registration

Registration includes the Monday evening welcome reception (more details to come) and both days of main conference activity. All meals will be provided on Tuesday, September 12th with breakfast and lunch being provided on Wednesday, September 13th. Note that coffee is not provided on campus, so be sure to get your morning caffeine before arriving.

Registration fees are:

$700 for early bird registration, ending 16 June 2023
$900 for standard registration, ending 28 July 2023
$1,100 for late registration, ending 1 September 2023

Travel

Provo, Utah does have an airport (PVU) that attendees can fly into. Given the small nature of both the airport and airlines at PVU, we suggest attendees consider flights into Salt Lake City (SLC) as well.

Hotels

Residence Inn Provo North — This location has a complimentary shuttle that attendees can schedule for rides to BYU. This hotel is located in a quieter part of town and features a river along the back of the property with a walking trail. A nearby shopping center is walking distance and offers popular food options and a Neighborhood Walmart.

Provo Marriott Hotel and Conference Center — This hotel is located in the heart of downtown Provo. It does not have a shuttle, but is an easy and beautiful quarter mile walk to the bus station. The bus is complementary and runs about every 8 minutes, followed by a 12-15 minute ride into campus. Being downtown, there are a number of shops, restaurants, parks, a rec center, etc. nearby.

Hyatt Place Provo — The Hyatt is across the street from the Provo Marriott Hotel and Conference center. Staying here would put you in downtown Provo, a quarter mile from the bus stop that runs into campus.

Schedule

All times are US Mountain Daylight Time (UTC-6)

The main venue for SLUG'23 is the Harman Continuing Education Building at BYU in Room 2258/2260. Parking is available in the lot to the west, please make sure to use spaces designated for guests at the Harman Building. Lunch and snacks will be provided in adjacent space to the conference rooms. Venues for the (optional) welcome reception Monday night, and for dinner on Tuesday night, are included in the agenda below.

Monday, 11 September 2023

Time	Speaker	Title
18:00 - 20:00	Welcome Reception Provo Marriott Hotel and Conference Center

Tuesday, 12 September 2023

Time	Speaker	Title
9:00 - 9:05	Auble – SchedMD	Welcome
9:05 - 10:00	David Jarvis – BYU	Keynote — Improving quinoa through the development of genetic and genomic resources
10:00 - 10:30	Break
10:30 - 11:00	Jacobsen and Samuel – NERSC	Never use Slurm HA again: Solve all your problems with Kubernetes
11:00 - 11:30	Younts – Guardant	Guardant Health Site Report
11:30 - 12:00	Hilton – SchedMD	Native Container Support
12:00 - 13:00	Lunch
13:00 - 13:30	Pratt and Feldman – CoreWeave	Providing the Power of Slurm on CoreWeave's Serverless Kubernetes Architecture
13:30 - 14:00	Byun – LLSC	Optimizing Diverse Workloads and Resource Usage with Slurm
14:00 - 14:30	Rini – SchedMD	State of the Slurm REST API
14:30 - 15:00	Break
15:00 - 15:30	Eyrich (Google) and Fryer (Recursion)	Build a flexible and powerful High Performance Computing foundation with Google Cloud
15:30 - 16:00	Fazio – Dow	Demand Driven Cluster Elasticity
16:00 - 17:00	Booth – SchedMD	Field Notes 7 – How to make the most of Slurm and avoid common issues
18:30 - 20:30	Dinner The Skyroom Ernest L. Wilkinson Student Center at BYU, 6th Floor

Wednesday, 13 September 2023

Time	Speaker	Title
9:00 - 9:30	Markuske – SDSC	Accelerating Genomics Research Machine Learning with Slurm
9:30 - 10:00	Nielsen – DTU	Saving Power with Slurm
10:00 - 10:30	Break
10:30 - 11:00	Day – LLNL	Running Flux in Slurm
11:00 - 11:30	Marani – CINECA	CINECA experience with Slurm
11:30 - 12:00	Christiansen – SchedMD	Step Management Enhancements
12:00 - 13:00	Lunch
13:00 - 13:30	Hafener – LANL	Simulation of Cluster Scheduling Behavior Using Digital Twins
13:30 - 14:00	Jezghani – Georgia Tech	PACE Site Report
14:00 - 14:30	Skjerven and Vaughn – AWS	Building Blocks in the Cloud: Scaling LEGO engineering with Slurm and AWS Parallel Cluster
14:30 - 15:00	Break
15:00 - 16:30	Wickberg – SchedMD	Slurm 23.02, 23.11, and Beyond (Roadmap); Open Forum

Abstracts

Building Blocks in the Cloud: Scaling LEGO engineering with Slurm and AWS Parallel Cluster

Brian Skjerven and Matt Vaughn, AWS

AWS ParallelCluster is a tool that enables R&D customers and their IT administrators to design a operate powerful and elastic HPC clusters on AWS. In this talk, we'll introduce ParallelCluster through the lens of LEGO engineering, who use ParallelCluster and Slurm to scale their simulations that support structural analysis and material science research. We'll discuss the overall hybrid HPC architecture that LEGO has built, with a particular focus on how Slurm work to extend their existing cluster. We'll also detail how LEGO handles the messy business of software license management for commercial applications in this hybrid environment — all with Slurm's help.

CINECA experience with Slurm

Alessandro Marani, CINECA

The Italian Supercomputing center CINECA adopted Slurm as its first choice resource scheduler since 2016, implementing it in many top tier HPC clusters including the latest arrival Leonardo, ranked at 4th place in the current Top500. In this report we discuss how we take advantage of the various features to manage the necessities of different communities sharing the same environment, and what customizations we implemented to resolve some complex situations. By sharing our successes and our difficulties we may also contribute to inspire new features that may be implemented in the future and would be very useful to our cause and that of other sites.

Providing the Power of Slurm on CoreWeave's Serverless Kubernetes Architecture

Navarre Pratt and Jacob Feldman, CoreWeave

CoreWeave is a premium cloud provider specializing in high performance GPU-powered workloads for AI/ML, batch processing, and scientific discovery. CoreWeave deploys massive scales of compute and some of the largest dedicated training clusters on the planet, all on top of Kubernetes. As the top choice for scheduling and managing HPC workloads, Slurm is a must-have solution for utilizing compute at this scale for batch workloads. In this talk, we will present the soon to be open-sourced Slurm on Kubernetes (SUNK) solution, a project in collaboration with SchedMD, that brings Kubernetes containerized deployments and Slurm together to provide the ultimate computing platform. We will discuss how SUNK was developed, its range of capabilities, and the role it played in the record-breaking MLPerf submission we completed with NVIDIA.

Demand Driven Cluster Elasticity

Mike Fazio, The Dow Chemical Company

An elastically scalable cluster can be a critical component in minimizing the time from job submission to execution. Few organizations have the resources on-premises to meet the peak demand on their supercomputer resources. Utilizing consumption-based compute to augment existing resources allows variable demand to be met while maintaining affordability. A turnkey High Performance Computing (HPC) on demand service provides a low barrier of entry with minimal skill up, but ultimately proved difficult to meet the needs of our organization. Utilizing a strategic set of tools allows the delivery of an elastically scalable cluster with a unified entry point for users while maintaining control of proprietary data. This talk will cover Dow's journey into hybrid on-premises/cloud HPC to provide researchers seamless access to computational resources.

Guardant Health Site Report

Alex Younts, Guardant Health

Guardant Health is a life sciences company based in Palo Alto, CA, and we endeavor to bring our products to everyone around the world in the fight against cancer. Our proprietary bioinformatics pipeline was originally developed to run on Sun Grid Engine. We began a transition to Slurm after a successful proof-of-concept engagement with SchedMD. Our goal was to enable the ability to compute anywhere by using a federation of our on-premise clusters and the cloud. We will present interesting details of our Slurm architecture, our results so far, and how we evangelized Slurm to our users and developers.

Simulation of Cluster Scheduling Behavior Using Digital Twins

Vivian Hafener, Los Alamos National Laboratory

The ability to accurately simulate the impact of changes to a system's scheduler configuration on the performance of a system is a capability that can guide decisions in the administration of HPC systems, provide recommendations to improve system performance, and validate the impact that proposed changes will have on a system prior to deployment. This presentation introduces a suite of tools based on a modified version of the open source BatSim simulation platform. This can be used to evaluate the scheduling performance of a system, to examine the impact of scheduling policy changes on jobs of different types, and to evaluate the impact of scheduled maintenance or other reservations on the job flow of the system. These tools use workload files generated by historical Slurm logs to evaluate the impact of such changes to a "digital twin" of the physical cluster, with an identical cluster configuration, job details, and scheduling policy. These tools are being used to inform LANL's production HPC operations and are under active development and enhancement. This illustration-rich presentation shows the breadth and applicability of the tools and techniques developed to date. A goal of this presentation is to solicit questions of interest which we could incorporate into this body of open-source work.

Optimizing Diverse Workloads and System Resource Usage

Chansup Byun, Lincoln Laboratory Supercomputing Center (LLSC)

At the Lincoln Laboratory Supercomputing Center (LLSC), we have very diverse workloads ranging from various machine learning and artificial intelligence applications to the traditional high performance computing applications and other simulation codes, to advanced database services, to dynamic web services, and to on-demand Jupyter Notebook services running on large cluster systems. We have been using Slurm to enable and scale such diverse workloads efficiently and continue to exploit advanced Slurm features to use system resources more efficiently. Recently we have introduced the whole node scheduling approach so that only one user's job or jobs can be scheduled on a node. There are many benefits with this scheduling approach and we will discuss about the details in this presentation. Spot job support is another feature implemented on selected LLSC systems to improve system resource usage with minimizing any impact on normal jobs. Spot jobs are a way to improve system utilization while providing users additional capacity to meet their computing needs. We have observed some issues with Slurm scheduling performance when preempting spot jobs and will discuss how we have achieved significant improvement in the scheduling performance in the presentation.

Running Flux in Slurm

Ryan Day, Lawrence Livermore National Laboratory (LLNL)

Flux is a novel, open source resource management package designed to enable complex workflows on modern, heterogeneous HPC systems. Its hierarchical design allows users to elegantly subdivide their allocation and coordinate scheduling of jobs in those sub-allocations. Flux is also easy for users to run inside of allocations from other resource managers. In this talk, I will describe Flux and some example workflows, then demonstrate how to launch and run a Flux instance inside of an allocation on a Slurm managed cluster.

Never use Slurm HA again: Solve all your problems with Kubernetes

Douglas Jacobsen and Chris Samuel, National Energy Research Scientific Computing Center (NERSC)

As part of the Perlmutter CrayEX system deployment, NERSC developed a production deployment of its Slurm controller and database on the on-system Kubernetes services cluster. This has led to both improved reliability and process improvements for managing the Slurm daemons and supporting infrastructure, especially around the Slurm database, but has also generated new options for how we interact with Slurm in general. By building "micro"-services out of the various components the HA is now managed directly by Kubernetes, common database operations are managed by a well known MariaDB operator, and overall reliability is higher than ever. This presents new integration options for the future that blur the lines between systems and cloud offerings.

PACE Site Report

Aaron Jezghani, PACE at Georgia Institute of Tech

Throughout FY23, the Partnership for an Advanced Computing Environment (PACE) at Georgia Institute of Technology has conducted a staggered scheduler migration to Slurm of approximately 2,000 servers across 4 clusters. Each of the 4 clusters provided unique challenges, including cost recovery via job accounting, instructional needs for a wide range of classes, and federal regulations for protected data that needed to be addressed. By treating each new requirement as an incremental change to the previous efforts and providing broad access to advanced training and testing opportunities, PACE has successfully migrated 3 clusters and is finalizing the last. We will present motivations for migrating to Slurm, challenges encountered through the migration, and experiences post migration.

Field Notes 7 — How to make the most of Slurm and avoid common issues

Jason Booth, SchedMD

Best practices and configuration advice from SchedMD's Director of Support.

Step Management Enhancements

Brian Christiansen, SchedMD

Native Container Support

Scott Hilton and Nate Rini, SchedMD

State of the Slurm REST API

Nate Rini, SchedMD

Slurm 23.02, 23.11, and Beyond (Roadmap)

Tim Wickberg, SchedMD

This presentation will focus on the upcoming Slurm 23.11 release, as well as a preview of plans for the successor Slurm 24.08 release, and beyond. Additional time will be allotted for community discussion and Q&A with the principle Slurm developers.

Accelerating Genomics Research Machine Learning with Slurm

William Markuske, SDSC

This presentation will discuss how the Research Data Services (RDS) team at the San Diego Supercomputer Center (SDSC) uses Slurm to support genomics researchers developing machine learning techniques for conducting genome-wide association studies and computational network biology at the University of California San Diego (UCSD). Genomics machine learning requires high throughput computing across heterogeneous hardware to meet the workflow demands of novel model development and training. The presentation will go through the configuration of a specially built National Resource for Network Biology (NRNB) compute cluster. The NRNB cluster consists of a heterogeneous node configuration including standard compute nodes, high memory nodes, and different GPU nodes to support about 50 genomics researchers. Slurm is used to manage resources on the cluster to reduce time to discovery for the researchers by tuning the environment for their specific needs. The presentation will discuss Slurm job throughput tuning for thousands of sub-node sized jobs, heterogeneous resource allocation and fair use, storage allocation, and deploying development Jupyter environments through Slurm. Furthermore, the presentation will demonstrate how Slurm is being used to automate sequence data ingestion and processing as part of the Institute for Genomics Medicine to support computational genomics efforts.

Saving Power with Slurm

Ole Holm Nielsen, Technical University of Denmark (DTU)

Energy costs have risen greatly in some parts of the world since mid 2022, and HPC centers experience an increased focus on saving on the electricity bill. The Slurm Power Saving Guide documents a method to turn nodes off and on automatically, both on-premise and in the cloud. Scripts for performing power actions are left up to individual sites. We report on experiences with on-premise node power saving, and present scripts based on IPMI power actions. Some challenges have been found with Slurm up to and including 22.05, and we discuss workarounds as well as solutions provided in 23.02. Hardware stability under frequent power cycles will be discussed.

Last modified 12 September 2023