PyTorch

To use PyTorch on the cluster, start by reviewing the Conda installer and how to manage Conda environments.

Note

Newer PyTorch versions are not available via Conda, but you can install them using pip within Conda environments.

There are two main ways to use PyTorch:

Using the Global Conda Environment

The cluster provides a pre-configured Conda environment with PyTorch. Only administrators can modify this environment, so you’re limited to the installed packages.
```
module load miniforge3/25.3.1-gcc-11.4.1
conda env list
conda activate pytorch
```

Creating a Custom Conda Environment

You can create your own environment and install PyTorch via pip. Note that cloning the global pytorch environment won’t include PyTorch itself, as it was installed via pip. Our GPUs support CUDA up to 12.9. Below are installation examples:

module load miniforge3/25.3.1-gcc-11.4.1
conda create --name my_pytorch_cuda11.8
conda activate my_pytorch_cuda11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

module load miniforge3/25.3.1-gcc-11.4.1
conda create --name my_pytorch_cuda12.6
conda activate my_pytorch_cuda12.6
pip install torch torchvision torchaudio

module load miniforge3/25.3.1-gcc-11.4.1
conda create --name my_pytorch_cuda12.8
conda activate my_pytorch_cuda12.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Verifying GPU Availability

After installing PyTorch and activating your environment, you can verify that PyTorch detects the available GPUs by using the following Python commands on a GPU node:

import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Current GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))

Remember you’ll need to request an interactive or batch job to be able to ssh into a GPU node. For example:

# Submit interactive job
srun --partition=gpu1h100 --job-name test_my_pytorch_env \
    --time=05:00 --nodes=1 --gpus=1 --ntasks-per-node=1 \
    --pty /bin/bash -i

# Create python file to test pytorch
cat <<EOF > pytorch_test.py
import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Current GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
EOF

# Execute the test program
module load miniforge3/25.3.1-gcc-11.4.1
conda activate my_pytorch_env
python pytorch_test.py

If CUDA is available and at least one GPU is detected, you should see output similar to:

CUDA available: True
Number of GPUs: 1
Current GPU: NVIDIA H100 NVL

Note

If torch.cuda.is_available() returns False, ensure that:

You are running on a compute node with GPU access (not a login or cpu node).
Your job explicitely requests 1 or more GPUs (e.g. --gpus=2, --gpus-per-node=2)
Your environment includes a PyTorch build with CUDA support.
The appropriate GPU drivers and CUDA libraries are available on the system.

Using GPUs in PyTorch

Once you’ve confirmed that your custom PyTorch environment detects the GPUs, you can start using it for computations. Below are common usage patterns:

Moving Tensors to GPU

You can move tensors to the GPU using the .to() or .cuda() methods:

import torch

# Create a tensor on the CPU
x_cpu = torch.randn(3, 3)

# Move it to the GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x_gpu = x_cpu.to(device)

print("Tensor device:", x_gpu.device)

Model Training on GPU

To train a model on the GPU, both the model and the data must be moved to the GPU:

import torch
import torch.nn as nn
import torch.optim as optim

# Dummy model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = SimpleModel().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Dummy data
inputs = torch.randn(32, 10).to(device)
targets = torch.randn(32, 1).to(device)

# Training step
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

print("Training step completed on:", device)

Monitoring GPU Usage

You can monitor GPU usage with:

nvidia-smi

This command shows GPU memory usage, active processes, and more.

Multi-GPU Usage in PyTorch

PyTorch supports single-node multi-GPU training. We present detailed examples below. Users are encourages to read torchrun (Elastic Launch) documentation for more information and use cases.

Single Node, Multi-GPU (DataParallel or DDP)

For simple use cases, you can use torch.nn.DataParallel, but for better performance and scalability, torch.nn.parallel.DistributedDataParallel (DDP) is recommended.

Below is a template you can use to run a batch job using DDP on a single node while using 2 GPUs and all CPU cores.

Note

Keep in mind that when using the nccl backend with DDP, only 1 process per GPU is allowed. For this case, each node has 48 CPU cores and 2 GPUs. Since we are using 1 process per GPU, we are left with 46 cores. We want each process to spawn multiple OpenMP threads, so we do 46 (CPU cores) / 2 (GPU processes) = 23 threads per GPU process.

#!/bin/bash
#SBATCH --job-name=ddp_single_node
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --time=01:00:00
#SBATCH --partition=gpu2h100

module load miniforge3/25.3.1-gcc-11.4.1
conda activate pytorch

# Each node in the gpu2h100 queue has 32 CPU cores and 2 GPUs. Since we
# are using 1 process per GPU, we are left with 32 cores.
# We want each process to spawn multiple OpenMP threads, so we
# do 32 (CPU cores) / 2 (GPU processes) = 16 (threads per GPU process)
export OMP_NUM_THREADS=16
# These are other OpenMP options used to control placement of threads
# in CPU cores
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m

srun python3 -m torch.distributed.run \
     --standalone \
     --nnodes=1 \
     --nproc-per-node=2 \
    /path/to/train.py

In your train.py, initialize DDP like this:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    model = MyModel().to(local_rank)
    ddp_model = DDP(model, device_ids=[local_rank])

    # Training loop here...

if __name__ == "__main__":
    main()

Here is a working example of the train.py

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import Dataset, DataLoader, DistributedSampler

# Dummy dataset
class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)
        self.labels = torch.randn(length, 1)

    def __getitem__(self, index):
        return self.data[index], self.labels[index]

    def __len__(self):
        return self.len

# Simple model
class SimpleModel(nn.Module):
    def __init__(self, input_size):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(input_size, 1)

    def forward(self, x):
        return self.linear(x)

def main():
    # Initialize the process group
    dist.init_process_group(backend="nccl")

    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    device = torch.device("cuda", local_rank)

    # Create model and move to GPU
    model = SimpleModel(input_size=10).to(device)
    model = DDP(model, device_ids=[local_rank])

    # Create dataset and distributed sampler
    dataset = RandomDataset(size=10, length=1000)
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

    # Loss and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # Training loop
    for epoch in range(5):
        sampler.set_epoch(epoch)
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            if batch_idx % 10 == 0 and local_rank == 0:
                print(f"Epoch {epoch} | Batch {batch_idx} | Loss {loss.item():.4f}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()