Basic Slurm Terminology

Partition (a.k.a. Queue)

A partition is a logical grouping of compute nodes, similar to a queue. Each partition may have different hardware, time limits, or access policies. Use sinfo to list available partitions.

To check the current status of available queues and nodes, use the sinfo command. This command provides a snapshot of which nodes are idle, allocated, or down, along with their associated partitions and time limits.

Example output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu384g*     up 3-00:00:00     77   idle cpusm[01-77]
cpu1500g     up 3-00:00:00      6   idle cpumd[01-06]
cpu6000g     up 3-00:00:00      3   idle cpulg[01-03]
gpu1h100     up 3-00:00:00     15  alloc gpusm[01-15]
gpu2h100     up 3-00:00:00     10  alloc gpumd[01-10]
hgxh200      up 3-00:00:00      1  alloc gpulg01

Node

A node is a physical machine in the cluster. Each node has a specific number of CPU cores, memory, and possibly GPUs.

To get detailed information about a specific node, use:

scontrol show node <nodename>

For example:

scontrol show node cpusm01

This will return detailed specs and current usage for the node:

NodeName=cpusm01 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUEfctv=32 CPUTot=32 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cpusm01 NodeHostName=cpusm01 Version=24.11.5
   OS=Linux 5.14.0-427.42.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 31 14:01:51 UTC 2024
   RealMemory=386000 AllocMem=0 FreeMem=381353 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpu384g
   BootTime=2025-10-24T14:55:46 SlurmdStartTime=2025-10-27T17:25:08
   LastBusyTime=2025-11-03T17:39:52 ResumeAfterTime=None
   CfgTRES=cpu=32,mem=386000M,billing=32
   AllocTRES=
   CurrentWatts=364 AveWatts=364x

Key fields to note:

  • CPUTot: Total number of CPU cores available.

  • RealMemory: Total memory available on the node.

  • State: Current status (e.g., IDLE, ALLOC, DOWN).

  • Partitions: Indicates which queue(s) the node belongs to.

Job

A job is a unit of work submitted to Slurm. It can be a single task or a collection of tasks (e.g., simulations, data processing).

Jobs are submitted using sbatch (batch jobs) or srun (interactive jobs).

A Job can have 5 states:

  • PD (Pending): Waiting for resources.

  • R (Running): Currently executing.

  • CG (Completing): Finishing up.

  • CD (Completed): Finished successfully.

  • F (Failed): Encountered an error.

To view jobs currently in the queue or running, use: squeue.

Example output:

                JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             3777   cpu384g matlab_p jd01  R    4:02:04      1 cpusm01
3852_[226-241%20]  gpu1h100 Ecthelio jd02 PD       0:00      1 (QOSMaxNodePerUserLimit)
         3852_224  gpu1h100 Ecthelio jd02  R       2:00      1 gpusm03
         3852_225  gpu1h100 Ecthelio jd02  R       2:00      1 gpusm03
         3852_222  gpu1h100 Ecthelio jd02  R       2:01      1 gpusm02
         3852_223  gpu1h100 Ecthelio jd02  R       2:01      1 gpusm02
             3343  gpu1h100 sno_pv80 jd03  R    4:02:05      1 gpusm01

Explanation of columns:

  • JOBID: Unique identifier for each job.

  • PARTITION: Queue the job was submitted to.

  • NAME: Job name.

  • USER: Submitting user.

  • ST: Job status (R = Running, PD = Pending).

  • TIME: Runtime so far.

  • NODELIST(REASON): Node(s) assigned or reason for pending status.

Job Steps

A job step is a unit of work executed within a running job. While a job defines the overall resource allocation (e.g., number of nodes, CPUs, memory, time), job steps are the actual commands or tasks that run using those resources.

Key Characteristics of a Job Step:

  • Executed with srun: Job steps are typically launched using the srun command inside a job script or interactively.

  • Shares job resources: All job steps run within the resource allocation defined by the parent job (sbatch).

  • Can run in parallel: Multiple job steps can be launched simultaneously, each using a subset of the allocated resources.

  • Useful for multi-task jobs: Ideal when you want to run several independent tasks (e.g., simulations, analyses) within a single job submission.