Slurm Arrays for Omics Data

When analyzing dozens or thousands of omics datasets, it is essential to parallelize the compute. In this post, I showcase a couple of ways for large-scale batch analysis on high-performance computing (HPC), specifically focusing on cases where each sample can be processed independently using the same workflow and comparable computational resources.

What is Slurm ?

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduler used by many research computing clusters. A good analogy here would be a large, bustling restaurant kitchen. Dozens of chefs (researchers) hand in orders (jobs), each needing different ingredients and cook times. Slurm acts as the head chef and kitchen manager. It reads every order, checks what equipment and ingredients are needed, and decides not just who cooks what, but when and in which part of the kitchen. This careful orchestration ensures efficient use of space and staff, prevents bottlenecks at the stoves, and helps every dish arrive on time.

For situations where many near-identical orders come in (like a banquet with hundreds of similar plates), Slurm can use job arrays to batch, track, and distribute the work so each plate (job) is prepared efficiently and in parallel, without mixing up the recipes (workflows).

In short, Slurm is the coordinator behind the scenes, ensuring smooth and efficient delivery of each computational dish on a shared cluster.

Resource: Official Slurm documentation lists many useful commands.

Slurm Arrays: simple way to batch

A Slurm array is an efficient way to run the same command or workflow across many samples or datasets with a single job submission.

An array is a collection of data items; like a list in Python (["sampleA", "sampleB", "sampleC"]) or a vector in R (c("sampleA", "sampleB", "sampleC")). When you use a for or while loop in R or Python, you are instructing the computer to perform the same operation repeatedly on each element in the list.

The same looping logic is translated to the cluster when you use a Slurm Array. Essentially Slurm executes the task in parallel for all the samples in the array. Each job in the array receives a unique task index (Slurm_ARRAY_TASK_ID) which allows the script to determine which sample or dataset to work on during that particular run. This allows for efficient and automatic processing of many samples at once, without having to submit each job individually.

Manual for-loop jobs:
  sbatch job_for_sampleA
  sbatch job_for_sampleB
  sbatch job_for_sampleC
      |
      +--> Scheduler runs each job as an independent submission

Slurm array job:
  sbatch --array=1-3 run_pipeline.sh
      |
      +--> Scheduler runs: job.1    job.2    job.3
                              |        |        |
                          [sampleA][sampleB][sampleC]

   (each job gets a unique Slurm_ARRAY_TASK_ID)

Example script

The file sample_ids.txt should contain the list of your sample names (or datasets) , one per line, and will be used to map each array job to a specific sample. For example:

sampleA
sampleB
sampleC
...

#!/bin/bash
#SBATCH --job-name=sample_array_job
#SBATCH --output=slurm_logs/%A_%a.out
#SBATCH --error=slurm_logs/%A_%a.err
#SBATCH --time=HH:MM:SS
#SBATCH --mem=XXG
#SBATCH --cpus-per-task=N
#SBATCH --partition=<name of partition as defined in the HPC>
#SBATCH --mail-type=END,FAIL,ARRAY_TASKS
#SBATCH --account=my_account

# Set paths
base_dir="/path/to/data"
sample_list="${base_dir}/sample_ids.txt"

# Get the sample for this array task (1-based index)
sample=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${sample_list}")

echo "Processing sample ${sample} (Array Task ID: ${SLURM_ARRAY_TASK_ID})"

# Make sample-specific output directory 
output_dir="${base_dir}/outputs/${sample}"
mkdir -p "${output_dir}"

# Activate env
source /path/to/conda.sh
conda activate my_env

# Run the analysis
Rscript my_analysis_script.R "${sample}" > "${output_dir}/${sample}_out.txt" 2>&1

# Deactivate env
conda deactivate

The Slurm job directives used here are:

SBATCH option	Purpose
`--job-name`	Name for monitoring in the queue/job list
`--output`	Output file for standard output (per array/task/job)
`--error`	Output file for standard error (per array/task/job)
`--time`	Maximum run time (hh:mm:ss)
`--mem`	Memory per task
`--cpus-per-task`	Number of CPU cores for each task
`--partition`	Which partition or queue to use (HPC specific)
`--mail-type`	When to send email notifications
`--account`	Charge jobs to this account/project

To keep the job submission script reusable for any dataset length, we can auto-assign the array length based on sample list.

$ sbatch --array=1-$(wc -l < /path/to/data/sample_ids.txt) run_array_job.sh

Is there a limit to array size ?

Yes! This is specific to each HPC system configuration. The key parameter to look out for the is MaxArraySize which defines the maximum number of tasks in a single job array. This value will be cluster specific and you can use scontrol to get actual values. For example,

$ scontrol show config | grep MaxArraySize

So the N in your job submission should always be less than or equal to MaxArraySize parameter on each cluster (i.e., --array=1-N where N ≤ MaxArraySize).

Even if you don’t see explicit user or group job limits in cluster configuration, real parallelism is always governed by available cluster resources and scheduler policies. Submitting an array task of MaxArraySize does not mean all N tasks will run at once. The running jobs will be limited by both the physical resources available and any user/job concurrency policies (which may not be visible in scontrol show config).

For best throughput, reach out to your HPC administrators for definitive guidance on practical and policy limits for user job concurrency and resource allocation.

How to proceed if your sample size exceeds the array size limits ?

If the total number of samples (S) is larger than the allowed MaxArraySize, you can employ chunking i.e., split your samples into several smaller lists or chunks and have each array job process one chunk at a time.

Full sample_ids.txt:         [sample1, sample2, ... sampleS]
      |
      |
  [split into K-sized chunks]
      |
      +---> chunk_manifest.txt:
                chunk_00.txt    # (samples 1 to K)
                chunk_01.txt    # (samples K+1 to 2K)
                ...
                chunk_N.txt

Run: sbatch --array=1-$(wc -l < chunk_manifest.txt) run_chunked_array.sh

For task i (SLURM_ARRAY_TASK_ID=i):
   Reads: chunk_file = $(sed -n "${SLURM_ARRAY_TASK_ID}p" chunk_manifest.txt)
   Processes every sample in chunk_file (typically K samples)

So now instead of each array mapping to an individual sample, it maps to K samples. You can use the cluster level limits to calculate the chunk size. For example, if

S = total number of samples
K = desired chunk size i.e., samples/chunk (should be ≤ MaxArraySize)
N = number of array jobs (chunks)

# Optimal chunk size to allow maximum parallelism 

## Actual parallel jobs running/user = min(user concurrency limit, available resources, partition policies) is usually less than the theoretical maximum number of array jobs N.

N = MaxArraySize

# Compute the initial chunk size:    
K_initial = ceil(S / N)

## For improved efficiency and resource management,
select a smaller chunk size by dividing the initial value by a factor (e.g., 4 or 5):
K = max(1, floor(K_initial / FACTOR))

# Recompute the number of chunks:
N_chunks = ceil(S / K)

#Split the samples into N_chunks chunk files
(each with up to K samples)

Note: Ensure N (sbatch --array=1-N) does not exceed MaxArraySize. K can be any positive integer; the real constraint is on N, not K.

Example script

So translating the chunking logic, we first need to generate the chunk_manifest.txt file:

#!/bin/bash

# Set paths
base_dir="/path/to/data"
sample_list="${base_dir}/sample_ids.txt"

# Get total number of samples
S=$(wc -l < "$sample_list")

# Get MaxArraySize from Slurm 
MAX_ARRAY_SIZE=$(scontrol show config 2>/dev/null | awk -F= '/MaxArraySize/ {print $2}' | xargs)
if [[ -z "$MAX_ARRAY_SIZE" || "$MAX_ARRAY_SIZE" = "0" ]]; then
    echo "WARNING: Could not determine MaxArraySize from Slurm config."
    echo "Please check your cluster settings and manually set this value if needed."
    exit 1
fi

N=$MAX_ARRAY_SIZE

# Get optimal chunk size
## Factor of 4 or 5 is generally advisable - can be increased to reduce per array resource usage
OPTIMAL_FACTOR=4 
K=$(( ((S + N - 1) / N) / OPTIMAL_FACTOR ))
if [ "$K" -lt 1 ]; then
  K=1
fi

N_CHUNKS=$(( (S + K - 1) / K ))    

echo "Splitting $S samples into $N_CHUNKS chunks (each with up to $K samples)..."

mkdir -p chunks

# Split the sample list into N chunks
split -d -n l/$N "$sample_list" chunks/chunk_

cd chunks
for f in chunk_*; do
  mv "$f" "$f.txt"
done
cd ..

ls chunks/chunk_*.txt > chunk_manifest.txt

echo "Created $(wc -l < chunk_manifest.txt) chunk files, listed in chunk_manifest.txt"

We then update the Slurm job submission script for the chunk logic:

#!/bin/bash
#SBATCH --job-name=sample_chunk_job
#SBATCH --output=slurm_logs/%A_%a.out
#SBATCH --error=slurm_logs/%A_%a.err
#SBATCH --time=HH:MM:SS
#SBATCH --mem=XXG
#SBATCH --cpus-per-task=N
#SBATCH --partition=<name of partition as defined in the HPC>
#SBATCH --mail-type=END,FAIL,ARRAY_TASKS
#SBATCH --account=my_account

# Set paths
base_dir="/path/to/data"
chunk_manifest="${base_dir}/chunk_manifest.txt"   # created from the chunking script

# Get the chunk file for this array task (1-based index)
chunk_file=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${chunk_manifest}")

echo "Processing chunk file: ${chunk_file} (Array Task ID: ${SLURM_ARRAY_TASK_ID})"

# Activate env
source /path/to/conda.sh
conda activate my_env

# Loop over every sample in this chunk file
while read -r sample; do
    echo "Processing sample: ${sample}"
    output_dir="${base_dir}/outputs/${sample}"
    mkdir -p "${output_dir}"

    # Run analysis, redirect each sample's output
    Rscript my_analysis_script.R "${sample}" > "${output_dir}/${sample}_out.txt" 2>&1
done < "${chunk_file}"

# Deactivate env
conda deactivate

Proceed with job submission as before:

$ sbatch --array=1-$(wc -l < chunk_manifest.txt) run_chunked_array.sh

Take away

Both per-sample array tasks and chunking have their advantages. Use single-sample tasks for straightforward tracking and flexibility. Switch to chunking when faced with cluster limits or have jobs with short runtimes. There is no one-size-fits-all. Your specific task and your cluster’s policies will almost always determine the best strategy for efficiency and parallelism.

Task	Strategy
Fine-grained job tracking and control needed	One array task per sample
Sample count exceeds array size or job limits	Chunking: multiple samples per array task
Samples have similar resource needs	Either approach works
Jobs with short runtime	Chunking to reduce scheduler overhead
Need to re-run or monitor specific samples easily	One array task per sample
Cluster docs recommend avoiding many small jobs	Chunking