Rackslab announces the release of slurm-quota, an open source solution to manage CPU and GPU time quotas for Slurm users and accounts.

Context

Many HPC and AI computing centers need a simple way to budget and track compute consumption in CPU time and GPU time at the user and account levels, and to prevent jobs from being submitted when remaining quota is insufficient. Unfortunately, vanilla Slurm does not provide this type of CPU/GPU time quota accounting and enforcement by default. We have designed slurm-quota to address this operational need with a lightweight implementation and integration points that fit standard Slurm workflows.

Example output of slurm-quota stats:

Example slurm-quota stats output

Features

For administrators, slurm-quota provides the controls required to operate quota policies in production. CPU and GPU minute quotas can be set at the user and account levels on the command line. On heterogeneous GPU clusters, billed GPU minutes can be weighted by GPU type using billing factors.

For users, slurm-quota provides transparent reporting of quota consumption and headroom. Commands display consumed and preallocated CPU/GPU minutes together with the configured limits, using visual progress bars. The statistics can be served with a JSON HTTP service as a single source of truth to query the same information across all cluster nodes.

Architecture

A core design goal is to keep the deployment lightweight, relying on Slurm native extension points and minimal dependencies.

The solution is composed of:

  • SQLite database to store usage, preallocations and quotas.
  • Single Python script (slurm-quota) to record the effective resource usage of completed jobs, manage quotas, serve stats on socket-activated HTTP JSON REST API and query stats.
  • Slurm job submission plugin (job_submit.lua) to pre-allocate CPU/GPU time at submission time.

Integration with Slurm happens at two key points:

  • At submission time, the submit plugin evaluates the job request and records a preallocation of CPU/GPU minutes for the user and the selected account.
  • At completion time, Slurm executes the job completion script (jobcomp/script) which calls the charge wrapper. The wrapper computes the effective usage (including GPU allocations from accounting TRES), and updates the SQLite database accordingly.

Try it

slurm-quota is released under the MIT license and available on GitHub!

To deploy and integrate it with your Slurm controller and compute nodes, follow the installation instructions in project’s online installation guide.

Acknowledgements

The development of this project was funded by ISDM-Meso, part of the University of Montpellier. We warmly thank them for supporting this work and making this project possible.