Setup
Setting Up The Analysis Directory
To begin we will need a space to work, let's create a directory to house all of our input data, scripts and results:
mkdir rna_seq_pipeline
cd rna_seq_pipeline
Now let's make subfolders for our data, scripts and results:
mkdir data
mkdir tools
mkdir qc_output
mkdir trimmed_output
mkdir star_index
mkdir alignment_output
mkdir featurecounts_output
Creating A Conda Environment
For reproducible research it is advisable to keep the software versions you use consistent. An easy way of ensuring this is by creating a Conda environment. For more information on how to build conda environments check out:
Here, we will enter our tools directory and create a conda environment from the following yml file:
cd tools
wget https://raw.githubusercontent.com/BioNomad/omicsTrain/main/docs/omics/transcriptomics/bulk_rna_seq/data/rnaseq_environment.yml
Now, let's create the environment and activate it!
conda env create -f rnaseq_environment.yml # create conda environment
source activate rnaseq # activate conda environment
cd .. # leave tools directory
Downloading Fastq Read Data
Today we will be working with data from Srinivasan et al. 2020 where they assessed transcriptional changes in patients with and without Alzheimer's disease. Let's create an accession list to download a few files from this study:
cd data
nano accList.txt
accList.txt
SRR8440545
SRR8440550
SRR8440537
SRR8440481
Now we will need meta data for these samples. The following data was taken from the SRA Run Selector. SRA, or sequence read archive, is a public repository for sequence data which we are pulling from for this analysis.
nano meta.txt
meta.txt
ID Diagnosis Age Sex
SRR8440545 Control 53 male
SRR8440550 Control 81 male
SRR8440537 AD 74 female
SRR8440481 AD 79 male
Before we can download our data we will need to configure the sra-toolkit with the following command:
vdb-config -i
Click X
and you are finished configuring! To download the sequence data we will use the following command:
fastq-dump -N 100000 -X 200000 --skip-technical --split-3 --clip --gzip $(<./accList.txt)
Explanation of Terms
-N
start at read 100000-X
end at read 200000--skip-technical
skip technical reads and only download biological reads--split-3
split paired end data--clip
remove adapters--gzip
compress the output
Download Reference Data
Now to actually figure out what genes are expressed we need some sort of reference to map our reads too. We will be using the following reference:
To download the fasta file complete the following steps:
- Go to Homo sapiens chromosome 17, GRCh38.p14 Primary Assembly
- Click
Send To
- Click
File
- Change the format to
FASTA
- Click
Create File
- Move the downloaded file to the
rna_seq_pipeline/data
folder
Downloading the Reference FASTA File
To download the gff3 file complete the following steps:
- Go to Homo sapiens chromosome 17, GRCh38.p14 Primary Assembly
- Click
Send To
- Click
File
- Change the format to
GFF3
- Click
Create File
- Move the downloaded file to the
rna_seq_pipeline/data
folder
Downloading the Reference GFF3 File