Introduction to Parallel

GNU Parallel is a shell tool that allows for independent jobs to be run in parallel over multiple compute resources.

Bash Loop Using Parallel

GNU Parallel can greatly speed up a task given that it can leverage multiple compute resources at once. Let's examine the case of the following bash loop (example taken from Yale Center For Research Computing - Parallel):

for letter in {a..f};
do
    echo $letter
done

output

a
b
c
d
e
f

To parallelize this task we can use the parallel module and ask for multiple CPUs per task:

salloc -c 4
module load parallel
parallel -j 4 "echo {}" ::: {a..f}

output

a
b
c
d
e
f

Parallel In A Bash Script

Additionally, we can leverage the parallel module in a batch script:

#!/bin/bash
#SBATCH --job-name=runParallel
#SBATCH --time=01-00:00:00
#SBATCH --nodes=1
#SBATCH -c 8
#SBATCH --mem=4G
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

# load modules
module load parallel
module load fastqc

# make an output directory
mkdir fastqc_output

# find all fastq files and run fastqc them
ls *.fastq.gz | parallel -j ${SLURM_CPUS_PER_TASK} "fastqc {} -o fastqc_output"

Here we load the parallel and fastqc modules. We then create an output directory (fastqc_output). In our command we list all our fastq files (ls *.fastq.gz), then use the parallel command to run fastqc on each file (fastqc {} -o fastqc_output). We reference each fastq file with the curly brackets {}. You'll also notice that we specify how many compute resources are available with -j ${SLURM_CPUS_PER_TASK}.

References

https://www.gnu.org/software/parallel/
https://docs.ycrc.yale.edu/clusters-at-yale/guides/parallel/