dbGAP Downloads
Downloading Fastq Data Using dbGAP
dbGAP is a repository of data assessing the connection between genotypes and phenotypes. Here we discuss how to access this data using the Tufts HPC.
- Obtain your dbGaP repository key by logging into dgGAP and clicking "get dbGAP repository key"
- Now, navigate to the dbGAP SRA RUN Selector, login with your credentials, select the files you'd like to download, and click Accession List:
-
Upload this ngc file and the accession list to the desired directory on the Tufts HPC cluster. For more information on how to login to the cluster visit: Navigate To The Cluster
-
Now you will need to load the tools needed to download your data:
module load sra/2.10.8
- Now you will need to configure the sratoolkit:
vdb-config --interactive
- Now set up the following batch script:
dbGAP_download.sh
#!/bin/bash
#SBATCH --job-name=dbGap
#SBATCH --time=07-00:00:00
#SBATCH --partition=largemem
#SBATCH --nodes=1
#SBATCH -c 8
#SBATCH --mem=110Gb
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=Your.Email@tufts.edu
module load sra/2.10.8 parallel
# using parallel
parallel --jobs 4 "fastq-dump -X 9999999999999 --ngc /path/to/projectNgcFile.ngc --split-files --gzip {}" < /path/to/accessionList.txt
dbGAP_download.sh
#!/bin/bash
#SBATCH --job-name=dbGap
#SBATCH --time=07-00:00:00
#SBATCH --partition=largemem
#SBATCH --nodes=1
#SBATCH -c 8
#SBATCH --mem=110Gb
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=Your.Email@tufts.edu
module load sra/2.10.8
# not using parallel
fastq-dump -X 9999999999999 --ngc /path/to/projectNgcFile.ngc --gzip $(</path/to/accessionList.txt)
- To run your script, enter the following:
sbatch dbGAP_download.sh
- To check on the status of your job, enter the following:
squeue -u $USER
- dbGAP repositories can contain a lot of data, so if you need your job extended reach out to tts-research@tufts.edu
Downloading Other dbGAP Data
- Obtain your dbGaP repository key by logging into dgGAP and clicking "get dbGAP repository key"
- Now, navigate to the dbGAP SRA RUN Selector, login with your credentials, select the files you'd like to download, and click Cart File:
-
Upload this ngc file and the accession list to the desired directory on the Tufts HPC cluster. For more information on how to login to the cluster visit: Navigate To The Cluster
-
Now you will need to load the tools needed to download your data:
module load sra/2.10.8
- Now you will need to configure the sratoolkit:
vdb-config --interactive
- Now set up the following batch script:
dbGAP_download.sh
#!/bin/bash
#SBATCH --job-name=dbGap
#SBATCH --time=07-00:00:00
#SBATCH --partition=largemem
#SBATCH --nodes=1
#SBATCH -c 8
#SBATCH --mem=110Gb
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=Your.Email@tufts.edu
module load sra/2.10.8
prefetch -X 9999999999999 --ngc your_file.ngc cart_prj#####_###.krt
vdb-decrypt --ngc your_file.ngc enc_file.xml
Note
Note that we add in the option -X 9999999999999
. This allows for files larger than 20GB, and without this option larger files will not download.
- To run your script, enter the following:
sbatch dbGAP_download.sh
- To check on the status of your job, enter the following:
squeue -u $USER
- dbGAP repositories can contain a lot of data, so if you need your job extended reach out to tts-research@tufts.edu