Like many researchers, I work across fast hot storage (scratch or SSD) and slow, cost-effective cold archives. Managing data across these tiers often means moving files between large, tape-backed archival systems and higher-performance disks for active analysis. My institution uses Globus for these transfers, which generally offers a smooth experience, especially through its web interface.

However, I have found certain usecases (e.g., estimating the size of nested folders, selecting just a handful of specific files or directories, or preparing batch transfers based on patterns) are not as straightforward to handle through the user interface. In these situations, I turn to the globus-cli which is command line interface for globus and some basic scripting to fill the gaps and make these specialized tasks more efficient and reproducible.

In this post, I will share examples of how I use globus-cli for targeted file retrieval, size estimation, and batch script generation when a point and click workflow isn’t enough. If your computing environment provides Globus and you find yourself with similar needs, you might find these patterns useful (or at least adaptable).

Globus, Endpoints & CLI Overview

Globus is a tool for transferring and sharing data between storage systems of all kinds; high-performance compute (HPC) clusters, tape or archival storage, cloud storage, and personal laptops. It manages large dataset transfers, handles failure and resume automatically, and can bridge systems that otherwise would not talk to each other.

Endpoints are the defined entry and exit points for data on the Globus network, much like designated docks in a shipping port. These might be large institutional storage servers, a lab’s scratch directory, a tape archive, or even your personal laptop. Endpoints are fixed entities i.e., once an endpoint is assigned, its identifier does not change unless the underlying system is reconfigured or decommissioned. This identifier is a Universally Unique Identifier (UUID). Additionally, all defined endpoints are discoverable via Globus, and hence, whether you are scripting or searching, you can reliably find and reference them for your transfers.

The globus-cli is a command-line tool for exploring, transferring, or scripting batch jobs between endpoints. While the web interface is great for interactive use, the CLI is ideal for automation or more complex workflows.

Check/install Globus CLI on any HPC

Most HPCs using Globus would most likely make globus-cli available as a module. You can check using:

module spider globus-cli

module avail globus-cli

If you are not sure of the exact spelling or want to list all related modules, you can use:

module avail globus

If available, you can easily load it for your session:

module load globus-cli

Tip: Follow the official install instructions which recommends pip or pipx. You can also use conda to install:
conda install conda-forge::globus-cli

Globus in my workflow

A quick overview of some of my most common use cases:

Within one HPC cluster:

+-----------------+        +-------------------+
|  Active Storage | <----> | Archive Storage   |
+-----------------+        +-------------------+

Between HPC clusters (also via archives):

+----------------------+         +-----------------------+
|       HPC 1          | <-----> |         HPC 2         |
+----------------------+         +-----------------------+
           ^                                ^
           |                                |
           +----------------+   +-----------+
                            |   |
                 +--------------------+
                 |  Archive Storage   |
                 +--------------------+

Within institution, sharing with outside groups (using collections):

   +------------------------------+
   |   Lab Group A Collection     |
   +------------------------------+
         |         |            |
         |         |            |
       (read)   (read/write)  (read-only)
         |         |            |
         v         v            v
+-----------+ +-----------+ +-----------+
| Lab B     | | Lab C     | | Lab D     |
+-----------+ +-----------+ +-----------+

Between institutions (using collections):

   +----------------------------------------+
   |       Institution 1 Collection         |
   +----------------------------------------+
        |                |                |
      (read)       (read/write)     (read-only)
        |                |                |
        v                v                v
+----------------+ +----------------+ +-----------------+
| Institution 2  | | Institution 3  | | Institution 4   |
|   Collection   | |   Collection   | |   Collection    |
+----------------+ +----------------+ +-----------------+

Globus Collections are highly versatile, which allow me to grant tailored access levels (read, write, share) to each institution or individual collaborator, all without needing HPC admin privileges. This enables complex, multi-institutional data sharing workflows that I can control directly.

Without Globus, sharing data between institutions is tedious, error-prone, and often restricts transfer size and data integrity.

Without Globus (multi-step workaround):

 (You)
+---------------------+
| Institute 1 HPC     |--------------------------+
+---------------------+                          |
     |                                           | 
     |                                           | 
     v                                           v 
+-------------------+           +-----------------------+
| Managed Device    |---------->| Institution 1 Cloud   |
| (Laptop/Desktop)  |           | (Box/Dropbox/S3/etc.) |
+-------------------+           +-----------------------+
                                          |
                                          v
                (Collaborator)  +---------------------+
                                | Managed Device      |
                                | (Laptop/Desktop)    |
                                +---------------------+
                                         |
                                         v
                                +---------------------+
                                | Institute 2 HPC     |
                                +---------------------+

When I Reach for Globus CLI & Scripting

Before diving into the specific scenarios where the globus-cli and some simple scripting have made my life easier, I want to highlight a small setup step that has saved me a lot of time. I keep track of all my frequently used Globus endpoints in a single, central file, and use a custom shell function to access any endpoint by name, whether I am scripting or working at the command line.

My approach to endpoint lookup

Here are general setup steps:

Store endpoints in a central file: I generally create and place this file in my home directory, updating it whenever I start working with a new collection, cluster, or archive resource.
```
  ~/globus_endpoints.txt
```
Contents (example UUIDs):
```
  archive    : ab65757f-00f5-4e5b-aa21-133187732a01
  hpc_1      : f1234567-89ab-4cde-bc10-234567890abc
  hpc_2      : c2345678-90ab-4def-a234-bcdef0123456
```

Define and export a function to the shell config file: Add this to your ~/.bashrc or preferred shell config:

 #get_endpoint function for Globus
 get_endpoint() {
     key="$1"
     grep -E "^${key}[[:space:]]*:" ~/globus_endpoints.txt | awk -F':' '{print $2}' | xargs
 }
 export -f get_endpoint

Reload shell: Update the shell for the get_endpoint to be globally available.
```
  source ~/.bashrc
```

Usecase 1: Storage planning, size estimation & quota management

Before any large recall or transfer, I check if the data will actually fit in active storage to avoid quota issues. My approach: always archive raw data, keep only what’s needed for current analysis on fast storage, and move the rest out when it’s no longer active. This keeps costs down and workflows efficient; a data management philosophy I rely on for any big research project.

Here, I illustrate with an example of estimating FASTQ file sizes in archival storage. Assume the folder structure is:

/group_name/project_name/data/
├── Sample01/
│   └── FASTQ/
│        ├── sample01_R1.fastq.gz
│        └── sample01_R2.fastq.gz
├── Sample02/
│   └── FASTQ/
│        ├── sample02_R1.fastq.gz
│        └── sample02_R2.fastq.gz
...

Script for estimating size:

#!/bin/bash

module load globus-cli

# Assumes get_endpoint() is exported from your shell config (see setup above); 
endpoint=$(get_endpoint archive)

# Set paths 
remote_base_dir="/group_name/project_name/data/"
output_dir="/project_name/raw_data"
output_file="${output_dir}/fastq_sizes_human_readable.txt"
sample_list="sample_ids.txt"   
# List of Sample IDs (one per line); use only to limit recall to a subset. Leave blank or omit this file to include all samples.

mkdir -p "$output_dir"

# Function to convert bytes to human-readable format
bytes_to_human() {
    b=${1:-0}; d=''; s=0; S=(Bytes {K,M,G,T,E,P,Y,Z}B)
    while ((b >= 1024 && s < ${#S[@]}-1)); do
        d=$(echo "scale=2; $b / 1024.0" | bc)
        b=${d%.*}
        ((s++))
    done
    echo "$d ${S[$s]}"
}

# Initialize output and total size variable
echo "File sizes (human-readable):" > "$output_file"
total_size=0

# Get recursive listing, filter for files, and write file sizes to output
globus ls -rl "$endpoint:$remote_base_dir" | \
    { 
      if [ -f "$sample_list" ]; then
          grep -Ff "$sample_list" | grep '/FASTQ/' | grep 'file'
      else
          grep '/FASTQ/' | grep 'file'
      fi
    } | \
    while IFS= read -r line; do
        size_bytes=$(echo "$line" | awk -F '|' '{print $4}' | xargs)
        file_path=$(echo "$line" | awk -F '|' '{print $7}' | xargs)
        full_path="$remote_base_dir$file_path"
        total_size=$((total_size + size_bytes))
        human_size=$(bytes_to_human $size_bytes)
        echo "$human_size $full_path" >> "$output_file"
    done

total_human_size=$(bytes_to_human $total_size)
echo -e "\nTotal size: $total_human_size" >> "$output_file"
echo "Directory size calculation complete. Results saved to $output_file."

Sample output file:

File sizes (human-readable):
2.1 GB /group_name/project_name/data/Sample01/FASTQ/sample01_R1.fastq.gz
2.0 GB /group_name/project_name/data/Sample01/FASTQ/sample01_R2.fastq.gz
1.9 GB /group_name/project_name/data/Sample02/FASTQ/sample02_R1.fastq.gz
2.2 GB /group_name/project_name/data/Sample02/FASTQ/sample02_R2.fastq.gz

Total size: 8.2 GB

Usecase 2: Selective recall across many samples

After estimating file sizes, I use the summary file to prepare a batch transfer file for globus-cli. This lets me move only what I need, wherever I need it.

Script for generating batch file:

#!/bin/bash

# Set paths
project_dir="/project_name/raw_data"
input_file="${project_dir}/fastq_sizes_human_readable.txt"
output_file="${project_dir}/globus_batch.txt"

# Associative array to prevent duplicate batch lines 
declare -A processed_samples

> "$output_file"

while IFS= read -r line; do
    # Skip headers/summaries
    [[ "$line" =~ ^(File\ sizes|Total\ size|$) ]] && continue
    # Extract file path 
    path=$(echo "$line" | sed 's/^[^ ]* [^ ]* //')
    # Get sample name from path 
    if [[ "$path" =~ /([A-Za-z0-9_-]+)/FASTQ/ ]]; then
        sample="${BASH_REMATCH[1]}"
    else
        continue
    fi
    # Avoid redundant batch lines per sample
    fastq_dir=$(echo "$path" | grep -o ".*/FASTQ")
    [[ -z "$fastq_dir" || -n "${processed_samples[$fastq_dir]}" ]] && continue
    processed_samples[$fastq_dir]=1
    dest_dir="${project_dir}/${sample}/FASTQ/"
    echo "--recursive $fastq_dir $dest_dir" >> "$output_file"
done < "$input_file"

echo "Globus batch file created at $output_file."

Sample globus_batch.txt output:

--recursive /group_name/project_name/data/Sample01/FASTQ /project_name/raw_data/Sample01/FASTQ/
--recursive /group_name/project_name/data/Sample02/FASTQ /project_name/raw_data/Sample02/FASTQ/
--recursive /group_name/project_name/data/Sample03/FASTQ /project_name/raw_data/Sample03/FASTQ/

Submit the recall request using globus-cli:

module load globus-cli

# Launch the transfer and capture ID
TASK_ID=$(globus transfer --batch "$(get_endpoint archive)" "$(get_endpoint cluster_1)" < /project_name/raw_data/globus_batch.txt | awk '/Task ID:/ {print $3}')
echo "Submitted Globus transfer with Task ID: $TASK_ID"

# Track transfer status
globus task show $TASK_ID

Tip: For large transfers, consider running this command as a Slurm job for reliability.

Usecase 3: Pattern-based file selection

Sometimes I need to recall only certain output files (like summary tables, metrics, or plots) for many samples, rather than all files in each directory. Here is how I automate that.

Assume this is the folder structure in archival storage:

/group_name/project_name/data/
├── Sample01/
│   └── pipeline_out/
│        ├── metrics_summary.csv
│        ├── sample01_feature-barcode-matrix.h5
│        ├── web_summary.html
│        ├── genome_bam.bam
│        └── other_output.txt
├── Sample02/
│   └── pipeline_out/
│        ├── metrics_summary.csv
│        ├── sample02_feature-barcode-matrix.h5
│        ├── web_summary.html
│        ├── genome_bam.bam
│        └── other_output.txt
...

Script to generate a batch file for just specific file types:

#!/bin/bash

# Set paths
project_dir="/project_name/raw_data"
input_file="${project_dir}/pipeline_file_sizes.txt"
output_file="${project_dir}/pattern_batch.txt"

# Example patterns for recall
pattern="csv\|h5\|html"  

sample_list="sample_ids.txt"         
# List of Sample IDs (one per line); use only to limit recall to a subset. Leave blank or omit this file to include all samples.

# Assumes get_endpoint() is exported from your shell config (see setup above); 
endpoint=$(get_endpoint archive)
remote_base_dir="/group_name/project_name/data/"

module load globus-cli
globus ls -rl $endpoint:$remote_base_dir > "${input_file}"

> "$output_file"

# Filter and generate pattern based batch file
## Check if sample subset to be processed
if [ -f "$sample_list" ]; then
    cat "$input_file" | grep -Ff "$sample_list" | grep '/pipeline_out/' | grep -E "$pattern" | grep 'file' | \
    while IFS= read -r line; do
        # Extract file path (remove the size field and prefix whitespace)
        path=$(echo "$line" | awk -F'|' '{print $7}' | xargs)
        dest_path="${project_dir}/$(basename "$(dirname "$path")")/pipeline_out/"
        echo "$remote_base_dir$path $dest_path" >> "$output_file"
    done
else
    cat "$input_file" | grep '/pipeline_out/' | grep -E "$pattern" | grep 'file' | \
    while IFS= read -r line; do
        # Extract file path (remove the size field and prefix whitespace)
        path=$(echo "$line" | awk -F'|' '{print $7}' | xargs)
        dest_path="${project_dir}/$(basename "$(dirname "$path")")/pipeline_out/"
        echo "$remote_base_dir$path $dest_path" >> "$output_file"
    done
fi

echo "Pattern-based batch file created at $output_file."

# Submit transfer request
globus transfer --batch "$(get_endpoint archive)" "$(get_endpoint cluster_1)" < "$output_file"

Sample pattern_batch.txt output:

/group_name/project_name/data/Sample01/pipeline_out/metrics_summary.csv /project_name/raw_data/Sample01/pipeline_out/
/group_name/project_name/data/Sample01/pipeline_out/web_summary.html /project_name/raw_data/Sample01/pipeline_out/
/group_name/project_name/data/Sample01/pipeline_out/sample01_feature-barcode-matrix.h5 /project_name/raw_data/Sample01/pipeline_out/

Tip: You can track your job in the Globus web UI or get list of your tasks on cli: globus task list

Globus CLI for Data Transfers

Selective, Scripted, and Shareable Transfers