Skip to the content.

HPC

Introduction to Workload Management

Now that we have covered Linux commands and how to put them in a script we can talk about how to manage running scripts. Some jobs could take a few minutes and some could take a few days depending on resources. It is at this point that we usually leverage High Performance Computing (HPC) clusters. Typically, you’ll get to a cluster either through the HPC website, where they will have a way for you to open a linux environment online or through an ssh command that looks like this:

ssh yourHpcCluster

Login Node

Now that you have made it to the HPC cluster, you will be at a login node. And you can think of a login node like a waiting room. DO NOT run any scripts here. For that you will need to leave the login node and get to a compute node. You can either get an interactive session on a compute node or run a batch script from the login node to run scripts. Let’s look at a common structure for HPC clusters:

Interactive Session

Now there are a few workload managers out there, in this tutorial we will demonstrate Slurm - a popular choice for HPC users. Slurm comes with commands we can use to manage our workload. The Slurm command to get to an interactive session is srun, here is an example:

srun -p batch -t 1-2:30:00 -n 1 --mem=2Gb --pty bash

Let’s break it down:

srun our command to start an interactive session

-p batch is telling us to use a compute node in the batch partition

-t 1-2:30:00 is the time we will use that node so here we are using it for 1 day, 2 hours, 30 minutes and 0 seconds

-n 1 is saying we will only need 1 compute node

--mem=2Gb is saying we will only need 2GB of memory

--pty bash is saying we would like a bash environment

To go back to the login node, just use exit

Checking Avaialble Nodes

We just demonstrated how to get to an interactive session on a compute node in the batch partition. But what if the batch partition is being used for other jobs? You can use sinfo to identify partitions, their availability, time limit for which they can be used, how many nodes are available, their state, and the list of available nodes:

Using a Module

Now what if you want to use some package, like BLAST, SRA Toolkit, etc.? You’ll need the following commands to investigate/load your modules:

Checking available modules: module avail

Checking if a particular software is installed: module avail softwarename

Load software: module load softwarename

Unload software: module unload softwarename

Purge all software: modlue purge softwarename

Check what is installed: module list

Writing a Batch Script

Interactive sessions are great for quick jobs, but say you want to run a job that can’t be finished in an hour or two. For that we need a batch script! Let’s see an example of one:

#!/bin/bash
#SBATCH --job-name=jobID
#SBATCH --time=03-00:00:00
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --mem=64Gb
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=YourEmailAddress

module load sra/2.10.8

#dir will be the directory to store our fastq files​
dir='fastq_storage'

#make our directory and change into it
mkdir -p $dir
cd $dir

#change the path to the sra accession list based on where you put it
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip $(</path_to_accession_list/SraAccList.txt)

Let’s unpack this. The #SBATCH headers specify your Slurm information, so your: jobID, time it will take, the partition to use, the number of nodes to use, the memory to use, the output file, the error file, and the email address to notify. Then we can go about writing our script! Say we save this script as sraPull.sh. we can then submit this script using:

sbatch sraPull.sh

Checking on Your Job

Congrats you have submitted a batch script! To check on how it’s doing we can use squeue:

We can also check per user with squeue -u your-username or per partition with squeue -p partitionOfInterest.

Cancel Your Job

If for some reason you would like to cancel your job, you can do that with the following command:

scancel jobIDYouWantToCancel

Learning More

If you are interested in learning more there are a lot of great tutorials out there:

SchedMD

References

1.SchedMD


Back To Introduction to Linux

Back To The Main Page