Associate metadata to Slurm jobs

Slurm batch scheduler allows users to associate to jobs optional fields in text format, such as the name of the job or a comment. These fields are not directly used by Slurm, their value has no influence on its behavior. They are available to store any information related to the jobs and can be manipulated through standard Slurm commands. For example, these fields can be useful for differentiating, filtering, classifying or annotating jobs.

We will discover in this post an approach to use the comment field to associate structured metadata to the jobs and enrich accounting reports with custom information.

The job comment can be defined when submitting jobs with --comment argument:

$ sbatch --comment hello --wrap "sleep 60"
Submitted batch job 943

It is then visible with the scontrol show job $ID command:

$ scontrol show job 943
JobId=943 JobName=wrap
   …
   Comment=hello
   …

It is also possible to define the comment after the job is submitted with scontrol update job $ID command.

By default, the comment is saved by Slurm as long as the job is in scheduling queue (pending or running). Once the job is finished, the comment is lost. However, it is possible to configure Slurm to save the comment in SlurmDBD accounting database, by enabling this setting in slurm.conf configuration file:

AccountingStoreFlags = job_comment

Job comments then become persistent after jobs end and they can be retrieved with the standard Slurm accounting command sacct.

The comment field can store any form of text, without constraint on its format. In particular, this text can be a serialized representation of a data structure, such as JSON.

Here is an example of a minimalist batch shell script comment.sh that sets a comment with an associative array containing a mesh key and a random integer value at the end of its execution:

#!/bin/sh

# compute stuff here

MESH=$(shuf -i 0-1000 -n 1)
scontrol update job $SLURM_JOB_ID comment="{\"mesh\": ${MESH}}"

For example, by submitting two jobs with this batch script:

$ sbatch comment.sh
Submitted batch job 894
$ sbatch comment.sh
Submitted batch job 895

It is then possible to extract this metadata from the accounting database, by coupling the output of the sacct command to the jq utility:

$ sacct --json | jq '.jobs | map({id: .job_id, user: .user, cores: .required.CPUs, meta: .comment.job|fromjson })'

[
  {
    "id": 894,
    "user": "remi",
    "cores": 1,
    "meta": {
      "mesh": 760
    }
  },1
  {
    "id": 895,
    "user": "remi",
    "cores": 1,
    "meta": {
      "mesh": 720
    }
  }
]

Here is a more advanced Python example script comment.py that updates the comment at the end of its execution with an associative array of 3 keys and values of different types:

 1#!/usr/bin/python3
 2import signal
 3import time
 4import atexit
 5import sys
 6import os
 7import subprocess
 8import random
 9import json
10
11def save_metadata():
12    """Save computation metadata in Slurm job's comment."""
13    job_id = os.getenv('SLURM_JOB_ID')
14    metadata = {
15        'mesh': random.randrange(0, 1000),
16        'complexity': random.random(),
17        'tag': random.choice(['choose', 'among', 'three']),
18    }
19    cmd = ['scontrol', 'update', 'job', job_id, f"comment={json.dumps(metadata)}"]
20    print(f"Saving metadata in Slurm job {job_id} comment field")
21    subprocess.run(cmd)
22
23
24def handle_timeout(signum, frame):
25    """Signal handler which stops the computation."""
26    signame = signal.Signals(signum).name
27    print(f"Signal {signame} ({signum}) received due to job timeout, saving "
28           "metadata and exiting properly")
29    sys.exit(0)
30
31
32def main():
33    # Bind SIGUSR1 sent by Slurm to notify of approaching job's timelimit
34    signal.signal(signal.SIGUSR1, handle_timeout)
35    # Register save_metadata() function to run just before exiting the program
36    atexit.register(save_metadata)
37
38    # Start fake computation for 5 minutes
39    print("Starting computation")
40    time.sleep(300.)  # simulating long interruptible computation
41
42
43if name == '__main__':
44    main()

The script registers the save_metadata() function (l11) with the atexit module (l36) to properly handle error cases and interruption by Slurm, typically in case of reached time limit or preemption.

This script has an approximate execution time of 5 minutes. It is submitted to Slurm a first time with a 10 minutes limit to let it enough time to end normally, and a second time with a 3 minutes limit with instruction to Slurm to send a SIGUSR1 signal 60 seconds before its termination:

$ sbatch --time 10 --wrap "srun python3 -u comment.py"
Submitted batch job 10773
$ sbatch --time 3 --signal USR1@60 --wrap "srun python3 -u comment.py"
Submitted batch job 10774

Here are the job outputs in both cases:

$ cat slurm-10773.out
Starting computation
Saving metadata in Slurm job 10773 comment field

$ cat slurm-10774.out
Starting computation
Signal SIGUSR1 (10) received due to job timeout, saving metadata and exiting properly
Saving metadata in Slurm job 10774 comment field

In the first case where the job ended normally, the save_metadata() function was executed at the end of the job. In the second case where the job had was interrupted because of its time limit, the script received SIGUSR1 signal, executed handle_timeout() signal handler which caused the script to stop and eventually triggered execution of save_metadata() function.

The generated metadata can then be extracted from Slurm accounting database with this command:

$ sacct --json | jq '.jobs | map({id: .job_id, user: .user, cores: .required.CPUs, meta: .comment.job|fromjson })'

[
  {
    "id": 10773,
    "user": "remi",
    "cores": 1,
    "meta": {
      "mesh": 376,
      "complexity": 0.2924316126744422,
      "tag": "among"
    }
  },
  {
    "id": 10774,
    "user": "remi",
    "cores": 1,
    "meta": {
      "mesh": 157,
      "complexity": 0.7724739043178511,
      "tag": "three"
    }
  }
]

This feature can be useful to associate various metadata to Slurm jobs, typically to generate additional and custom metrics in clusters accounting reports.