Basic Slurm Terminology
Partition (a.k.a. Queue)
A partition is a logical grouping of compute nodes, similar to a queue. Each partition may have different hardware, time limits, or access policies.
Use sinfo to list available partitions.
To check the current status of available queues and nodes, use the sinfo command.
This command provides a snapshot of which nodes are idle, allocated, or down, along with their associated partitions and time limits.
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu384g* up 3-00:00:00 77 idle cpusm[01-77]
cpu1500g up 3-00:00:00 6 idle cpumd[01-06]
cpu6000g up 3-00:00:00 3 idle cpulg[01-03]
gpu1h100 up 3-00:00:00 15 alloc gpusm[01-15]
gpu2h100 up 3-00:00:00 10 alloc gpumd[01-10]
hgxh200 up 3-00:00:00 1 alloc gpulg01
Node
A node is a physical machine in the cluster. Each node has a specific number of CPU cores, memory, and possibly GPUs.
To get detailed information about a specific node, use:
scontrol show node <nodename>
For example:
scontrol show node cpusm01
This will return detailed specs and current usage for the node:
NodeName=cpusm01 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUEfctv=32 CPUTot=32 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cpusm01 NodeHostName=cpusm01 Version=24.11.5
OS=Linux 5.14.0-427.42.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 31 14:01:51 UTC 2024
RealMemory=386000 AllocMem=0 FreeMem=381353 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cpu384g
BootTime=2025-10-24T14:55:46 SlurmdStartTime=2025-10-27T17:25:08
LastBusyTime=2025-11-03T17:39:52 ResumeAfterTime=None
CfgTRES=cpu=32,mem=386000M,billing=32
AllocTRES=
CurrentWatts=364 AveWatts=364x
Key fields to note:
CPUTot: Total number of CPU cores available.
RealMemory: Total memory available on the node.
State: Current status (e.g., IDLE, ALLOC, DOWN).
Partitions: Indicates which queue(s) the node belongs to.
Job
A job is a unit of work submitted to Slurm. It can be a single task or a collection of tasks (e.g., simulations, data processing).
Jobs are submitted using sbatch (batch jobs) or srun (interactive jobs).
A Job can have 5 states:
PD (Pending): Waiting for resources.
R (Running): Currently executing.
CG (Completing): Finishing up.
CD (Completed): Finished successfully.
F (Failed): Encountered an error.
To view jobs currently in the queue or running, use: squeue.
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3777 cpu384g matlab_p jd01 R 4:02:04 1 cpusm01
3852_[226-241%20] gpu1h100 Ecthelio jd02 PD 0:00 1 (QOSMaxNodePerUserLimit)
3852_224 gpu1h100 Ecthelio jd02 R 2:00 1 gpusm03
3852_225 gpu1h100 Ecthelio jd02 R 2:00 1 gpusm03
3852_222 gpu1h100 Ecthelio jd02 R 2:01 1 gpusm02
3852_223 gpu1h100 Ecthelio jd02 R 2:01 1 gpusm02
3343 gpu1h100 sno_pv80 jd03 R 4:02:05 1 gpusm01
Explanation of columns:
JOBID: Unique identifier for each job.
PARTITION: Queue the job was submitted to.
NAME: Job name.
USER: Submitting user.
ST: Job status (R = Running, PD = Pending).
TIME: Runtime so far.
NODELIST(REASON): Node(s) assigned or reason for pending status.
Job Steps
A job step is a unit of work executed within a running job. While a job defines the overall resource allocation (e.g., number of nodes, CPUs, memory, time), job steps are the actual commands or tasks that run using those resources.
Key Characteristics of a Job Step:
Executed with srun: Job steps are typically launched using the
sruncommand inside a job script or interactively.Shares job resources: All job steps run within the resource allocation defined by the parent job (
sbatch).Can run in parallel: Multiple job steps can be launched simultaneously, each using a subset of the allocated resources.
Useful for multi-task jobs: Ideal when you want to run several independent tasks (e.g., simulations, analyses) within a single job submission.