.. _slurm_single_job_multiple_steps: Single Job with Multiple Job Steps ================================== If your workload consists of many small, independent tasks (e.g., 50 simulations), and each task uses only a few cores, you can submit a single job that launches multiple parallel job steps within the script. **Example Setup:** - Each node in the cpu384g queue has 32 cores. - Request 8 nodes. So, 260 cores total. - Divide 260 cores across 50 tasks. That is ~5 cores per task (``floor(260/50)``). - Each node has ~384 GB of memory. That is ~12 GB per core (``floor(384GB/32)``). Assuming each individual task takes up to 6 hours and all tasks run in parallel, the overall runtime should also be around 6 hours. However, in practice, delays or unexpected hang-ups can occur. To account for this, it's a good idea to request additional time as a buffer. For example, adding a 2-hour grace period brings the total requested runtime to 8 hours, which helps ensure the job completes successfully even if a few tasks take longer than expected. **Sample Slurm Script:** .. code-block:: bash #SBATCH --partition=cpu384g #SBATCH --nodes=8 #SBATCH --ntasks=260 #SBATCH --time=08:00:00 #SBATCH --mem-per-cpu=12288M #SBATCH --error=job.err #SBATCH --output=job.out #SBATCH --name=my_simulations module load your_program for i in {1..50}; do # --exact Each job step is only given the resources it requested to avoid contention # --exclusive Ensures srun uses distinct CPUs for each job step # The & makes each job step run in parallel srun --exact --exclusive --cpus-per-task=5 your_program ... & done # wait for all job steps to finish wait **This approach:** - Submits only one job, staying within the 20-job limit. - Maximizes utilization of your allowed 8 nodes. .. _slurm_job_array_submission: Job Arrays ========== If you prefer to submit each task as a separate job, use Slurm job arrays. Each array index counts as a job, so you must limit concurrent jobs to 20. **Example Setup:** - 50 tasks total, so ``--array=0-49%20`` (only 20 run at a time). - 2 nodes have 64 cores. So, with 20 array jobs running at a time you can maximize utilization with ~3 cores per array job (``floor(64/20)``). - Memory per task: 3 cores * 4023 MB = 12069 MB. **Sample Slurm Script:** .. code-block:: bash #SBATCH --partition=cpu384g #SBATCH --ntasks=3 #SBATCH --mem=12069M #SBATCH --array=0-49%20 #SBATCH --time=08:00:00 #SBATCH --error=array_%A_%a.err #SBATCH --output=array_%A_%a.out #SBATCH --name=my_array_jobs module load your_program your_program ... **This approach:** - Uses job arrays to manage many tasks. - Keeps concurrent jobs within the allowed limit. Requesting CPU Nodes with Multiple Parallel Tasks ================================================= This is perhaps the most straightforward scenario and is ideal when you're running an application that uses only CPUs. Below are basic templates for both batch and interactive jobs. **Batch Job Template** .. code-block:: bash #!/bin/bash #SBATCH --partition=cpu384g #SBATCH --nodes= #SBATCH --ntasks-per-node= #SBATCH --cpus-per-task= #SBATCH --time= **Interactive Job Template** .. code-block:: text salloc --partition=cpu384g --nodes= --ntasks-per-node= --cpus-per-task= --time= Now, let's break down how to choose values for ````, ````, and ```` depending on how your application behaves: Case 1: Threaded Applications (e.g., OpenMP, TBB, MPI threads) -------------------------------------------------------------- If your application internally distributes work across threads (e.g., using OpenMP or similar), and does not spawn multiple processes, then: - nodes = 1 - processes = 1 - threads = 32 (assuming you're using a full node with 32 cores) This setup gives your application full access to all cores on a single node for threading. Case 2: Multi-Process Applications (Single Node) ------------------------------------------------ If your application spawns multiple processes that should run within the same node (and does not benefit from distribution across multiple nodes), then: - nodes = 1 - processes = 32 - threads = 1 This configuration runs 32 independent processes, each using one core. Case 3: Hybrid Parallelism (Multiple Processes, Each with Threads) ------------------------------------------------------------------ If your application runs multiple processes, and each process uses multiple threads (e.g., VASP without GPU support), then: - nodes = 1 - processes = - threads = This allows you to fine-tune how many cores each process uses, while keeping everything within a single node. Case 4: Multi-Node Jobs ----------------------- If your workload requires multiple nodes, set: - nodes = - processes >= number of nodes (at minimum) .. warning:: Be mindful of per-user node limits. Requesting more nodes than allowed will result in job rejection or queuing delays. Additional Options ------------------ You can further customize how your tasks run: - Use multiple job steps within a single job (see: :ref:`Single Job with Multiple Job Steps `) - Use job arrays to manage many similar tasks efficiently (see: :ref:`Job Arrays `) These approaches help you stay within job submission limits while maximizing resource utilization. Single GPU node with one task per GPU and CPU cores evenly distributed across tasks =================================================================================== This setup is ideal for applications that leverage GPU acceleration and can run multiple tasks in parallel. **Hardware Overview (Per GPU Node)** - There are 2 GPUs per node in the ``gpu2h100`` partition, so 1 tasks total. - There are 32 CPU cores per node in the ``gpu2h100`` partition, so 16 cpus per task (``32 cpus / 2 tasks``) - Slurm recognizes 3000000MB of RAM on each GPU node, so CPU memory per task (i.e. RAM) = CPU memory per gpu (since we have 1 task per GPU) = 3000000MB/2 = 1500000MB. With these numbers in mind, here are the basic templates: **Batch Job Template** .. code-block:: bash #!/bin/bash #SBATCH --partition=gpu2h100 #SBATCH --nodes= #SBATCH --ntasks-per-node=2 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-task=16 #SBATCH --mem-per-gpu=1500000M #SBATCH --time= **Interactive Job Template** .. code-block:: text salloc --partition=gpu2h100 --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --cpus-per-task=16 --mem-per-gpu=1500000M --time= This configuration ensures that each task gets exclusive access to one GPU and a fair share of CPU and memory resources. It's particularly useful for GPU-enabled applications that: - Run two independent tasks per node, each using one GPU. - Benefit from dedicated CPU cores for preprocessing, I/O, or hybrid CPU-GPU workloads. - Require large memory allocations, such as deep learning models or molecular simulations. If your application only uses one GPU per node, you should change from the ``gpu2h100`` queue to the ``gpu1h100`` queue for your job and scale the cpu and mem accordingly When arrays keep in mind that each task is bound to a GPU, so you can either: 1. Submit 4 array jobs at a time (i.e. ``%4``) if each array job requests a GPU. 2. Submit 2 array jobs at a time (i.e. ``%2``) if each array job requests 2 GPUs.