Slurm User Group Meeting 2016

Registration

The conference cost is

  • $275 per person for early registration by 1 August 2016
  • $375 per person for standard registration by 31 August 2016
  • $650 per person for late registration starting 16 September 2016

This includes presentations, tutorials, lunch and snacks on both days, plus dinner on Monday evening.
Register here.

Hotels

Many hotels are available in and around Athens. A few options are listed below for your convenience.

Adrian Hotel ***
Closest metro station: Monastiraki (blue line, green line)
Website: https://adrian.reserve-online.net
Address: 74, Adrianou st., Plaka, 105 56 Athens Greece
Tel.: (+30) 210 32 21 553 - 32 50 454
Fax: (+30) 210 32 50 461

Plaka Hotel ***
Closest metro station: Monastiraki (blue line, green line)
Website: http://www.plakahotel.gr
Address: 7, Kapnikareas & Mitropoleos Street. GR 10556 Athens Greece
Reservations.: Tel.+30 2103222706|Fax. +30 2103211800
Email: plaka@athenshotelsgroup.com

Herodion Hotel ****
Closest metro station: Acropolis (red line)
Website: http://www.herodion.gr/main.php
Address: 4,Rovertou Galli Str., Acropolis, Athens Greece
Tel: +30 210 9236832
E-mail: herodion@herodion.gr

The Athenian Callirhoe Exclusive Hotel ****
Closest metro station: Syngrou/Fix (red line)
Website: http://www.atheniancallirhoe.athenshotels.it
Address: 32, Kallirrois Av. & Petmeza, 117 43 Athens

Agenda

Hosted by the Greek Research and Technology Network (GRNET) and SchedMD.

The 2016 Slurm User Group Meeting will be held on 26 and 27 September at the Technopolis, 100 Pireos Street, Athens, Greece. The meeting will include an assortment of tutorials, technical presentations, and site reports. The schedule and abstracts are shown below.

Schedule

26 September 2016

Time Theme Speaker Title
08:00 - 08:30Registration
08:30 - 08:45WelcomeFlorosWelcome
08:45 - 09:30KeynoteCourniaComputer-aided drug design for novel anti-cancer agents
09:30 - 10:00Tutorial SanchezSlurm Overview
10:00 - 10:20Break
10:20 - 10:50TechnicalAuble, GeorgiouOverview of Slurm Version 16.05
10:50 - 11:20TechnicalRoyMCS (Multi-Category Security) Plugin
11:20 - 11:50TechnicalPaulBurst Buffer Integration and Usage with Slurm
11:50 - 12:50Lunch
12:50 - 13:20TechnicalMoríñigoSlurm Configuration Impact on Benchmarking
13:20 - 13:50TechnicalFenoyReal Time Performance Monitoring
13:50 - 14:20TechnicalAlexandreOptimising HPC resource allocation through monitoring
14:20 - 14:40Break
14:40 - 15:10TechnicalGlesserSimunix, a large scale platform simulator
15:10 - 15:40Site ReportCardoSwiss National Supercomputer Centre (CSCS) Site Report
15:40 - 16:10TechnicalGuldmyrConfigure a Slurm cluster with Ansible
19:00 Dinner Butchershop and Sardelles Persefonis 19, Athina 118 54, Greece

27 September 2016

Time Theme Speaker Title
08:30 - 09:00TechnicalRodríguez-PascualCheckpoint/restart in Slurm: current status and new developments
09:00 - 09:30TechnicalWickbergSupport for Intel Knights Landing (KNL)
09:30 - 10:00TechnicalPerrySupport of heterogeneous resources and MPMD-MPI
10:00 - 10:20Break
10:20 - 10:50TechnicalRajagopalImproving system utilization under strict power budget using the layouts
10:50 - 11:20TechnicalCadeauHigh definition power and energy monitoring support
11:20 - 11:50TechnicalChristiansen, BartkiewiczFederated Cluster Scheduling
11:50 - 12:50Lunch
12:50 - 13:20TechnicalAuble, GeorgiouSlurm Roadmap
13:20 - 13:50Site ReportYoshikawaEDF Site Report
13:50 - 14:20Site ReportPancorboLRZ Site Report
14:20 - 14:40Break
14:40 - 15:10Site ReportJacobsenNERSC Site Report
15:10 - 15:40Site ReportNikoloutsakosGRNET Site Report
15:40 - 16:10ClosingWickbergClosing discussions


Abstracts

26 September 2015

Keynote: Computer-aided drug design for novel anti-cancer agents

Dr. Zoe Cournia (Biomedical Research Foundation, Academy of Athens)

Overview of Slurm Version 16.05

Danny Auble (SchedMD)
Yiannis Georgiou (Bull)

This presentation will describe a multitude of new capabilities provided in Slurm version 16.05 (released May 2016) which are not covered in a separate talk. These enhancements include:

  • Deadline based job scheduling
  • User ability to reorder priorities of pending jobs
  • Forcing Generic Resources (GRES) and CPUs allocated to a job to be in the same NUMA domain
  • Ability to establish per-task dependencies in a job array
  • Added support for PMIx
  • Added command wrappers for the LSF/OpenLava resource managers
  • Added support for GridEngine options to the qsub command wrapper

MCS (Multi-Category Security) Plugin

Aline Roy (CEA)

Supercomputers are commonly shared between different populations of users. To ensure populations confinement, it is necessary to enforce constraints preventing users not belonging to a particular population to access nodes and/or view informations related to other populations. MCS logic aims at logically confining users, their jobs and the nodes they used using a modular approach to define populations. Thanks to a dedicated MCS plugin, a new security label is associated to every jobs to optionally ensure that nodes can only be shared among jobs sharing the same security label. Job and node informations can optionally be filtered based on their MCS labels in coordination with the PrivateData option.

Different plugins are currently implemented in order to propose at configuration time different methods to aggregate users into categories: mcs/none, mcs/user, mcs/group.

This talk will present the MCS logic, the Slurm options to use, as well as future improvements.

Burst Buffer Integration and Usage with Slurm

David Paul (NERSC)

NERSC has been one of the early adopters of Burst Buffers (NVRAM-based storage). Allocation and reservation of Burst Buffers via command line constructs can be a daunting task. Slurm's integration to Cray's DataWarp Burst Buffer implementation has greatly simplified the user and application access to this new, advanced technology. This technical presentation will cover; NERSC's configuration, usage examples, burst buffer status determination, problem identification and error recovery.

Slurm Configuration Impact on Benchmarking

José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García (CIEMAT)

The present work summarizes the experiments conducted in a small, modern cluster located at CIEMAT, dedicated to research activities and presently managed by Slurm. In short, this cluster ACME comprises eight 8-cores/CPU dual-sockets nodes. It is mainly oriented to support the development of fault-tolerance techniques as well as to improve computational efficiency and usage of this production computing facility.

This work discusses the results of running a bunch of building block algorithms, the Parallel Numerical Aerodynamic Simulator Benchmark (in short, NPB). NPB is developed by NASA and extensively used to the present, as it is characteristic of many scientific codes. Thus, it is expected that this contribution will be of interest as feedback to other scientific groups who are users of production clusters.

This investigation explores how the setup parameters of the Slurm configuration may affect the results of executing NPB. Basically, MPI processes mapping in a systematic way have been conducted, partitioning the jobs into a diversity of configurations of nodes (that is, sets of cores of CPUs which can be placed in the same or not node). A key information to look at, is how the execution time varies depending on the sparsity of the MPI processes into the cluster and how the queued jobs are performing: sending the job in an isolated fashion to the nodes (sequentially); sharing the network; and, sharing both the network and nodes among the jobs.

Which is better or what can be expected from doing so? The effect of sharing or not the sockets by various jobs at the same time is also analysed. The major interest of clarifying such an aspect of Slurm is defining criteria for Slurm configuration, optimizing processes mapping and improving the exploitation of HPC clusters in the context of scientific production. Also, it is a first and interesting first step to further design new scheduling algorithms that could be integrated into Slurm.

Real Time Performance Monitoring

Carlos Fenoy (Roche)

With the increasing number of cpu cores in compute nodes of high performance clusters, proper monitoring tools become essential to understand the usage and the behavior of the applications running in the cluster. In this work a new approach to near real-time monitoring is presented, using the Slurm profiling plugin to display resource usage information for each of the processes running in the cluster. This data improves the understanding of the applications running and can help in highlighting to the user any application-related issue.

Optimising HPC resource allocation through monitoring

Beche Alexandre (EPFL)

High performance computing is very much focusing on supercomputer performance; however, it is equally important to ensure the optimal usage of allocated resources. Reserving resources through Slurm in a shared cluster does not guarantee they will be used wisely, and thus could prevent other users to run their workload. To understand job behaviour, the Slurm accounting database is a rich source of information but lacks some visibility on compute nodes' system metrics. How can these two sources of information be correlated to provide good insights into job behaviour, optimal resource utilisation and potential bottlenecks?

In the Blue Brain Project at EPFL, we have implemented such a system based on modern open source technologies. In this talk, we will present the challenges of resources usage optimisation and describe how we implemented our solution to collect, index, correlate and visualise accounting and monitoring high resolution data using standard tools such as collectd, ElasticSearch, Graphite and Grafana.

Simunix, a large scale platform simulator

David Glesser, Adrien Faure, Yiannis Georgiou (BULL)

Slurm comes with a lot of scheduling features and possible configurations, thus, the installation of Slurm on a cluster can be very tedious. The configuration needs to be adjusted to fit the cluster's requirements. Furthermore, implementing new features in Slurm is difficult due to its complexity and the wide range of supported platforms. The Slurm Simulator provides a way to experiment with different scheduling configurations faster, however, the Slurm simulator does not test the real Slurm since core parts are modified.

Configure a Slurm cluster with Ansible

Johan Guldmyr (CSC)

Ansible is a tool for agentless config management of servers. It can operate in Push mode ­ssh to the remote server and then in the order ensure that the remote host is configured as you want. In Pull mode with for example a cronjob the server fetches the latest/branch from a remote version control system and then configures itself.

An ansible playbook is a set of roles applied in top to bottom order.
An ansible role is a set of tasks executed in top to bottom order.
An ansible task takes in principle a module name and the module’s settings.

An example task: ensure slurmctld service is only started and enabled where we also install slurmctld.
An example role: configure slurm
An example playbook: configure slurm, sshd and install $vendor firmware tools on one more more hosts.

In a Finnish HPC Grid (FGCI, 8 clusters with different admins in each site. CSC IT Center for Science provides support) we are using ansible-­role-­slurm in push mode to configure slurmctld, slurmdbd and submit nodes. Pull mode from a local git mirror to configure slurmd on compute nodes.

Other ansible roles are used to configure everything else ­ this role idea is that ansible­-role-­slurm should do everything that's absolutely necessary for a slurm cluster such as: generate or install a munge key, install Slurm packages from a repository and sacctmgr create cluster.

Swiss National Supercomputer Centre (CSCS) Site Report

Nicholas Cardo

27 September 2015

Checkpoint/restart in Slurm: current status and new developments

Manuel Rodríguez-Pascual, José A. Moríñigo, Rafael Mayo-García (CIEMAT)

This talk will describe the existing status of checkpoint/restart technologies in Slurm and the new developments made by our group.

In this conference, the integration between Slurm and DMTCP will be presented. DMTCP is an interesting checkpoint library because it is located in user space, thus there is no need to modify the machine kernel. Also, it does not require modifying or recompiling the application being executed. This is important because it allows checkpointing any application being run on the cluster, including legacy and proprietary ones (not only open source that can be recompiled to be linked against a particular library). This integration has been made via a SPANK plugin. It adds a "--with-dmtcp" flag to the "sbatch" command. When set, Slurm will call DMTCP so it starts monitoring the application to be checkpointed. After that, slurm API and command calls regarding checkpoint (scontrol checkpoint create / vacate / restart) will seamlessly work.

Another small yet useful development is the inclusion of checkpoint/restart calls in Slurm simulator. Until now the checkpoint-related operations were disabled in the simulator, so they have been implemented as part of this work and now any API call to slurm_checkpoint functions works as expected.

In order to embrace the current status of checkpoint/restart, a brief revision of BLCR library and checkpoint mechanism will be documented too. This way the talk will include all the existing technologies and can serve as a reference to future users.

Support for Intel Knights Landing (KNL)

Morris Jette and Tim Wickberg (SchedMD)

The Intel Knights Landing (KNL) chip has a number of unusual capabilities affecting resource management. The most significant feature for Slurm to manage is the chip's ability to modify its cache and NUMA configuration at boot time. This means that Slurm just not only track the node's current configuration, but the configurations which can be made available when the node is rebooted. Various quantities of cache resources can be made available user addressable high bandwidth memory (HBM), which Slurm treats as a generic resource (GRES) on the node. The various NUMA configurations necessitate support for dynamically changing the groupings of the cores and memory. Slurm has addressed these, and other challenges, through the addition of a new "node_features" plugin infrastructure.

This presentation will cover an overview of the KNL architecture and the changes made to Slurm in order to support it, including work still underway.

Support of heterogeneous resources and MPMD-MPI

Martin Perry, Bill Brophy, Doug Parisek, Steve Melhberg, Nancy Kritkausky, Yiannis Georgiou (BULL/CEA), Matthieu Hautreux (CEA)

This presentation will provide details about the study and development efforts to extend the job description language of Slurm in order to better handle complex jobs having different tasks behaviors and resources requirements upon platforms. It will provide an analysis of the heterogeneous resources and the MPMD (Multiple Program Multiple Data) model support upon Slurm.

Slurm, in its current stable versions provides support for SPMD (Single Program Multiple Data) as well as a limited MPMD support. By limited MPMD support, we mean that despite users can specify different binaries to be used within a parallel job, all the tasks are currently associated with the same resources requirements.

Hence, the current functioning of Slurm is not very well suited to manage complex jobs. For example, users willing to leverage different types of hardware resources inside the same MPI application, having part of their code running on GPUs while another is running on standard CPUs and a last part on CPUs with large memory per core, have to request for the most complete set of resources for each task wasting some of the hardware with tasks that will not need all of them. In some cases, the total configuration required to run such a job does not even exist as all the nodes of the cluster may not provide all the hardware features.

For example, applications needing real-time visualization where a big number of compute nodes should be allocated in conjunction with a small number of GPU nodes or complex workflows taking into account data locality and I/O-nodes allocation will be executed more efficiently with the support of heterogeneous resources and MPMD. In more detail, the MPI processes need to exchange information between the different heterogeneous resources hence demanding a participation of all processes in the same MPI_COMM_WORLD environment. The MPMD-MPI is supported through a tight integration of Slurm with the different MPI implementations using the PMI protocols. The related developments are part of the functionalities developed for the European funded H2020 project TANGO.

Improving system utilization under strict power budget using the layouts framework and RAPL

Dineshkumar Rajagopal, David Glesser, Yiannis Georgiou (BULL)

This presentation will present optimizations of the power adaptive scheduling technique of Slurm developed for Linux platforms. These optimizations provide the capability to redistribute the power consumption used by applications, improving system utilization, based upon what they actually really use - not estimations - and guarantee with hardware mechanisms that the maximum power consumption budget will be respected. The new extensions are structured upon the support of RAPL powercapping on the socket-level, the layouts framework of Slurm for internal representation of power consumption per component and an automatic rebalancing of power usage between jobs by considering the RAPL power computations, following the guidelines of the power plugin in Slurm. The updates of nodes power consumption is provided by collecting RAPL model's energy consumption measures in regular intervals and calculating the average power consumption for that interval. By keeping the updates of components power consumption within Slurm (through layouts framework) we can redistribute the power on other running jobs that need it or on upcoming submissions hence contributing in improving system utilization respecting the determined power budget. Finally we will show performance evaluation experimentations upon emulated and real HPC platforms.

High definition power and energy monitoring support

Thomas Cadeau, Yiannis Georgiou (BULL)

Slurm provides functionalities that enable power monitoring per node as well as power profiling and energy accounting per job based on in-band IPMI and RAPL measurement interfaces. The usage of in-band IPMI power measurement technique for extracting the power profile and calculating the energy consumption of a job has some drawbacks such as overhead and precision problems.This presentation will show the design and evaluation of new functionalities to support vendor specific hardware for high definition power and energy monitoring in Slurm to deal with the overhead and improve the accuracy of the measures. This presentation will briefly present the High Definition Energy Efficiency Monitoring (HDEEM) project, a sophisticated approach towards system wide and fine-grained power measurements which is a collaboration between BULL HPC vendor and TUD university. It will then focus into presenting the new developed plugin ipmi-raw to support the HDEEM interface which is a dedicated measurement FPGA installed on every blade improving spatial granularity to measure blade, CPU, and DRAM power consumption separately along with temporal granularity to up to 1 kSa/s.

Federated Cluster Scheduling

Brian Christiansen, Dominik Bartkiewicz (SchedMD)

Slurm has provided limited support for resource management across multiple clusters, but with notable limitations. We have designed Slurm enhancements to eliminate these limitations in a scalable and reliable fashion while increasing both system utilization and responsiveness. This design allows jobs to be replicated across multiple clusters with the job’s executing cluster being determined through policies and the coordination of the clusters with the same job (e.g. starting on the cluster that can run the job the soonest). Unique enterprise-wide jobs IDs will be used to permit rapid enterprise-wide job operations such as job dependencies, status reports and cancellation. The SlurmDBD is responsible for configuring sets of clusters that will work together in a federated fashion and reporting the configurations to the Slurm daemons. Each cluster operates with a great deal of autonomy. A limited number of inter-cluster operations are coordinated directly between the Slurm daemons managing each individual cluster. We anticipate the overhead of this design to be sufficiently low for Slurm to retain the ability to executing hundreds of jobs per second per cluster. An overview of the design will be presented along with an analysis of its capabilities.

Slurm Roadmap

Danny Auble (SchedMD)
Yiannis Georgiou (BULL)

This presentation will describe new capabilities planned in future releases of Slurm

  • Advanced Xeon Phi Knights Landing support
  • Generic burst buffer support
  • Greater control over a computer’s power consumption
  • Control over a job’s frequency limits based upon its QOS
  • Inter-cluster job management
  • Dynamic runtime settings environment
  • Support of VM and containers management within Slurm (HPC, Cloud/Big Data)
  • Deploy Big Data workflow upon HPC infrastructure

EDF Site Report

Cecile Yoshikawa (EDF)

EDF (Electricité de France) is one of the world's largest electric utility companies. EDF covers every sector of expertise, from electricity generation to trading and transmission grids. For all of that we extensively use high performance computing.

Our researchers and engineers conduct calculations on a wide range of fields: structural mechanics, fluid mechanics, and in a more specific way neutronics. Most of the codes executed on our supercomputers are developed by our R&D departments. In response to their needs, we are designing and operating several top-class supercomputers. One of the characteristics at EDF is that most of our supercomputers run our in-house OS, Scibian, a Debian-based distribution dedicated to industrial engineering and that we are currently turning into an Open Source community project.

All of our supercomputers have been using Slurm as a job scheduler since 2012. This site report will first present a brief overview of the Slurm configuration and features we use. It will then focus on the monitoring tools we have been developing to work with Slurm, SlurmWeb and JobMetrics. The final part will detail our current work to execute our Slurm jobs into containers.

Leibniz Rechen Zentrum (LRZ) Site Report

Juan Pancorbo (LRZ)

As a service provider for scientific high performance computing, Leibniz Rechen Zentrum (LRZ) operates compute systems for use by educational institutions in Munich, Bavaria, as well as on the national level.

LRZ provides own computing resources as well as housing and managing computing resources from other institutions such as Max Planck Institute, Technical University Munich, or Ludwig Maximillian University.

The tier 2 Linux cluster operated at LRZ is a heterogeneous system with different types of compute nodes, divided into 18 different clusters, each of which is managed by SLURM. The various clusters are configured for the different needs and services requested, ranging from single node thousand core NUMAlink shared memory clusters, to a 28-way infiniband- connected cluster for parallel job execution, or 28-way infiniband- connected cluster for serial job execution. Currently we have 9 clusters for general use (mostly European students), 8 housed cluster for the exclusive usage of different departments from Max Plank Institute and Munich Universities and 1 small test cluster.

The management of all clusters is centralized on a single virtual machine. The required SLURM control daemons run concurrently on this VM.

With the use of a wrapper script called MSLURM, the SLURM administrator can send SLURM commands to any cluster in an easy-to use and flexible manner, including starting or stopping the complete SLURM subsystem.

On June 2015 we decommissioned our old 4-ways serial processing nodes (300 in total) and replaced them with the COOLMUC2 system (384 dual socket 14 core Haswell EP nodes). 60 of these nodes replace the old hardware on serial and a new cluster was created (mpp2) to handle the rest of the nodes.

These nodes have the IBM Active Energy Manager which has among other tasks:

  • Monitoring power consumption data
  • Collecting power consumption data

Based on the slurm plugins acct_gather_energy_cray and acct_gather_energy_rapl we have used the IBM AEM interface to read the energy counters on the node and created the acct_gather_energy_ibmaem plugin. This plugin is currently included with the source code. The plugin runs on each of the compute nodes to gather the energy counter information during the job run and once that the job has finished it adds the energies on reported by all the nodes and shows them per step.

We have checked the energy measurements of the plugin against the energy reported by the nodes chassis in the rack and we got consistent data between both systems with a maximum of 5% difference (which is also the maximum error of the energy reported by the chassis according to the manufacturer)

NERSC Site Report

Douglas Jacobsen (NERSC)

The National Energy Research Scientific Computing Center (NERSC) recently transitioned both of its Cray XC class supercomputers to Slurm scheduling and resource management. Most notably, NERSC has been working with Cray and SchedMD to integrate HPC and Data intensive workloads on the Cori system, one of the largest open science systems in the world, delivering both Intel Xeon Haswell and Intel Xeon Phi (Knights Landing) processors, integrating DataWarp burst buffers, environment virtualization through Shifter, and orchestrated by Slurm (among many other technologies bringing it all together). This talk will focus on on the special use cases, customizations, and strategies used to bring Slurm onto these systems to deliver highly productive computing for the thousands of NERSC users.

Experience using Slurm on ARIS HPC System

N. Nikoloutsakos , D. Dellis, K. Gkinis , I. Liabotis , and E. Floros (GRNET)

GRNET is the leading resource provider in Greece in the field of supercomputing infrastructures. It currently operates Greeces first national high-performance computing system ARIS (Advanced Research Information System) to support large-scale scientific applications. ARIS platform consists of 532 computational nodes which are partitioned in four architecture types between thin, fat, gpu and mic nodes, all connected under the same Infiniband FDR fat tree topology. Overall ARIS offers a theoretical peak performance of 444 TFlop/s. In addition ARIS incorporates 2 PB of fast access storage and 2 PB of archive storage for long term data preservation.

More than 150 projects and 400 users have used ARIS platform. Access is open to the Greek academic community. GRNET publishes calls for Project Access according to a fixed schedule. Eligible projects are assessed via transparent peer-review process which will ensure fair usage of hpc resources and assure high impact results to the scientific community.

This presentation will give an overview of the ARIS system, the Slurm batch system setup and the configuration of the fairshare priority model used to allocate resources among research groups from various disciplines, universities and research centers. We shall discuss our experiences of the use of Slurm among this year, usage evaluation , the evolution and tuning of our Slurm deployment, issues, solutions and open questions that could possibly provide insights to the benefits of Slurm.

Last modified 23 September 2016