Single Job with Multiple Job Steps

If your workload consists of many small, independent tasks (e.g., 50 simulations), and each task uses only a few cores, you can submit a single job that launches multiple parallel job steps within the script.

Example Setup:

Each node in the cpu384g queue has 32 cores.
Request 8 nodes. So, 260 cores total.
Divide 260 cores across 50 tasks. That is ~5 cores per task (floor(260/50)).
Each node has ~384 GB of memory. That is ~12 GB per core (floor(384GB/32)).

Assuming each individual task takes up to 6 hours and all tasks run in parallel, the overall runtime should also be around 6 hours. However, in practice, delays or unexpected hang-ups can occur. To account for this, it’s a good idea to request additional time as a buffer.

For example, adding a 2-hour grace period brings the total requested runtime to 8 hours, which helps ensure the job completes successfully even if a few tasks take longer than expected.

Sample Slurm Script:

#SBATCH --partition=cpu384g
#SBATCH --nodes=8
#SBATCH --ntasks=260
#SBATCH --time=08:00:00
#SBATCH --mem-per-cpu=12288M
#SBATCH --error=job.err
#SBATCH --output=job.out
#SBATCH --name=my_simulations

module load your_program

for i in {1..50}; do
  # --exact       Each job step is only given the resources it requested to avoid contention
  # --exclusive   Ensures srun uses distinct CPUs for each job step
  # The & makes each job step run in parallel
  srun --exact --exclusive --cpus-per-task=5 your_program ... &
done

# wait for all job steps to finish
wait

This approach:

Submits only one job, staying within the 20-job limit.
Maximizes utilization of your allowed 8 nodes.

Job Arrays

If you prefer to submit each task as a separate job, use Slurm job arrays. Each array index counts as a job, so you must limit concurrent jobs to 20.

Example Setup:

50 tasks total, so --array=0-49%20 (only 20 run at a time).
2 nodes have 64 cores. So, with 20 array jobs running at a time you can maximize utilization with ~3 cores per array job (floor(64/20)).
Memory per task: 3 cores * 4023 MB = 12069 MB.

Sample Slurm Script:

#SBATCH --partition=cpu384g
#SBATCH --ntasks=3
#SBATCH --mem=12069M
#SBATCH --array=0-49%20
#SBATCH --time=08:00:00
#SBATCH --error=array_%A_%a.err
#SBATCH --output=array_%A_%a.out
#SBATCH --name=my_array_jobs

module load your_program

your_program ...

This approach:

Uses job arrays to manage many tasks.
Keeps concurrent jobs within the allowed limit.

Requesting CPU Nodes with Multiple Parallel Tasks

This is perhaps the most straightforward scenario and is ideal when you’re running an application that uses only CPUs. Below are basic templates for both batch and interactive jobs.

Batch Job Template

#!/bin/bash
#SBATCH --partition=cpu384g
#SBATCH --nodes=<nodes>
#SBATCH --ntasks-per-node=<processes>
#SBATCH --cpus-per-task=<threads>
#SBATCH --time=<walltime>

Interactive Job Template

salloc --partition=cpu384g --nodes=<nodes> --ntasks-per-node=<processes> --cpus-per-task=<threads> --time=<walltime>

Now, let’s break down how to choose values for <nodes>, <processes>, and <threads> depending on how your application behaves:

Case 1: Threaded Applications (e.g., OpenMP, TBB, MPI threads)

If your application internally distributes work across threads (e.g., using OpenMP or similar), and does not spawn multiple processes, then:

nodes = 1
processes = 1
threads = 32 (assuming you’re using a full node with 32 cores)

This setup gives your application full access to all cores on a single node for threading.

Case 2: Multi-Process Applications (Single Node)

If your application spawns multiple processes that should run within the same node (and does not benefit from distribution across multiple nodes), then:

nodes = 1
processes = 32
threads = 1

This configuration runs 32 independent processes, each using one core.

Case 3: Hybrid Parallelism (Multiple Processes, Each with Threads)

If your application runs multiple processes, and each process uses multiple threads (e.g., VASP without GPU support), then:

nodes = 1
processes = <number of processes>
threads = <threads per process>

This allows you to fine-tune how many cores each process uses, while keeping everything within a single node.

Case 4: Multi-Node Jobs

If your workload requires multiple nodes, set:

nodes = <number of nodes>
processes >= number of nodes (at minimum)

Warning

Be mindful of per-user node limits. Requesting more nodes than allowed will result in job rejection or queuing delays.

Additional Options

You can further customize how your tasks run:

Use multiple job steps within a single job (see: Single Job with Multiple Job Steps)
Use job arrays to manage many similar tasks efficiently (see: Job Arrays)

These approaches help you stay within job submission limits while maximizing resource utilization.

Single GPU node with one task per GPU and CPU cores evenly distributed across tasks

This setup is ideal for applications that leverage GPU acceleration and can run multiple tasks in parallel.

Hardware Overview (Per GPU Node)

There are 2 GPUs per node in the gpu2h100 partition, so 1 tasks total.
There are 32 CPU cores per node in the gpu2h100 partition, so 16 cpus per task (32 cpus / 2 tasks)
Slurm recognizes 3000000MB of RAM on each GPU node, so CPU memory per task (i.e. RAM) = CPU memory per gpu (since we have 1 task per GPU) = 3000000MB/2 = 1500000MB.

With these numbers in mind, here are the basic templates:

Batch Job Template

#!/bin/bash
#SBATCH --partition=gpu2h100
#SBATCH --nodes=<nodes>
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-gpu=1500000M
#SBATCH --time=<walltime>

Interactive Job Template

salloc --partition=gpu2h100 --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --cpus-per-task=16 --mem-per-gpu=1500000M --time=<walltime>

This configuration ensures that each task gets exclusive access to one GPU and a fair share of CPU and memory resources. It’s particularly useful for GPU-enabled applications that:

Run two independent tasks per node, each using one GPU.
Benefit from dedicated CPU cores for preprocessing, I/O, or hybrid CPU-GPU workloads.
Require large memory allocations, such as deep learning models or molecular simulations.

If your application only uses one GPU per node, you should change from the gpu2h100 queue to the gpu1h100 queue for your job and scale the cpu and mem accordingly

When arrays keep in mind that each task is bound to a GPU, so you can either:

Submit 4 array jobs at a time (i.e. %4) if each array job requests a GPU.
Submit 2 array jobs at a time (i.e. %2) if each array job requests 2 GPUs.