Differential Expression

So far we have taken sequencing reads, removed poor quality sequences/adapters, aligned these reads to a reference genome, and counted the aligned reads based on whether or not they mapped to a gene/feature. What we are now left with is a counts table:

Reads To Counts - Adapted from HBC DGE Workshop

We will now take these counts and use a differential expression tool, DESeq2, to normalize our data to ensure that our counts are not biased by things like gene length, sequencing depth, etc. DESeq2 normalizes gene experssion data using the following procedure:

DESeq2 Normalization

DESeq2 also builds a regression model to assess if the slope between two conditions is different. This essentially will tell us, is the gene expression in one group (Alzheimer's patients) higher/lower than another group of patients (Control patients):

DESeq2 Regression Modelling

DESeq2 then will test if this slope is 0 (meaning no difference between Control and AD) or not 0 (meaning there is a difference between Control and AD) with the Wald statistic. This statistic will then be compared to a normal distribution to get a p-value!

DESeq2 Wald Statistic

Loading/Cleaning Data

We will now be working with R to analyze our data! So open RStudio and navigate to the folder rna_seq_pipeline. In that folder create a script called differential_expresson.R. We will start by adding the following to the script:

# --- Load Libraries -----------------------------------------------------------
library(tidyverse)         # data manipulation/plotting
library(DESeq2)            # differential expression
library(EnhancedVolcano)   # creating volcano plots
library(ggpubr)            # publication ready plotting
library(clusterProfiler)   # enrichment analysis

Now we need to some data cleaning. Our featureCounts table has gene information from the gtf file as the first few columns. This will need to be moved to a separate data frame. Additionally, our column names are currently the full name of the bam file. We will be removing the extra characters from these column names. Finally we will make sure our counts data and meta data have a matching sample order:

# --- Load Data  ---------------------------------------------------------------

# load meta data 
meta <- read.csv(
  file = "./path/to/rna_seq_pipeline/data/meta.txt",
  sep="\t"
)

# load featureCounts output
fc_counts <- read.csv(
  file = "./path/to/rna_seq_pipeline/featurecounts_output/featurecounts_results.txt",
  sep = "\t",
  skip = 1)

# isolate the feature data and clean up gene name column
features <- fc_counts %>%
  dplyr::select(names(fc_counts)[!grepl(".bam",names(fc_counts))]) %>%
  mutate(Gene=gsub("gene-","",Geneid))

# isolate the counts data 
counts <- fc_counts %>%
  dplyr::select(names(fc_counts)[grepl(".bam",names(fc_counts))])

# clean up names to remove file name info
names(counts) <- gsub(
  "alignment_output.|_trimmed_Aligned.sortedByCoord.out.bam",
  "",
  names(counts))

# order counts columns to match meta data
counts <- counts[,meta$ID]

# make the genes the rownames:
counts <- cbind(data.frame(Gene=features$Gene),counts)

Pre-Processing

Before we get started with DESeq2 we are going to filter out genes with all zero counts and factor our condition variable (Control v. AD). By factoring we specify there is an order to our condition variable with Control being the base:

# --- Pre-Processing -----------------------------------------------------------

# remove genes with all zero counts:
counts <- counts %>%
  filter(apply(counts[,-1],1,sum) != 0)

# factor our column of interest
meta$Diagnosis <- factor(
  meta$Diagnosis,
  levels = c("Control","AD")
)

Differential Expression

We will be applying a fairly standard DESeq2 pipeline, where we:

create a DESeqDataSet from the counts and meta data
specify the model formula Control v. AD
run DESeq2
shrink the log2FoldChanges to account for differences in standard error
convert our results to a data frame

# --- Differential Expression --------------------------------------------------
# create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData=counts, 
                              colData=meta, 
                              design=~Diagnosis,
                              tidy = TRUE)

# run DESeq2 on data
dds <- DESeq(dds)

message("Differential Expression Analysis Complete")

# shrink the log2 fold changes
res <- lfcShrink(dds=dds, coef=2, type="apeglm")

# convert to a data frame
res <- data.frame(res)

Volcano Plot

To plot our differentially expressed genes, or DEGs, we will use a volcano plot or a plot of log2 fold change versus the -log10 of the p-value where genes towards the top are more significant:

# --- Volcano Plot -------------------------------------------------------------

# use the EnhancedVolcano library to plot our differentially expressed genes
EnhancedVolcano::EnhancedVolcano(
  toptable = res,                                          # name of results df
  x = "log2FoldChange",                                    # name of log2FC column
  y="pvalue",                                              # name of p-value column
  pCutoff = .1,                                            # p-value cutoff
  FCcutoff = 1e-06,                                        # fold change cutoff
  xlim = c(                                                # x axis limits
    min(res$log2FoldChange)*1.25,
    max(res$log2FoldChange)*1.25),
  ylim = c(0,                                              # y axis limits
           -log10(min(res$pvalue))*1.25),                 
  lab=rownames(res),                                       # gene names                     
  title = "Volcano Plot",                                  # title
  subtitle = "Differential Expression of AD RNA-Seq Data") # subtitle

Volcano Plot

PCA Plot

PCA or Principal Component Analysis is a way of visualizing the variation in high dimensional data in just a few dimensions. For more information, check out our tutorial on PCA. Let's examine the variation in our gene expression data and color by disease status:

# --- PCA Plot -----------------------------------------------------------------

# perform variance stabilization for pca
rld <- rlog(dds, blind=FALSE)

# plot pca plot 
plotPCA(object = rld,               # variance stabilized counts data
        intgroup=c("Diagnosis")) +  # variable to color by
  ggpubr::theme_pubr(               # ggplot theme to use
    legend = "right"                # where to put the legend
  )+
  labs(
    color="Diagnosis"               # legend color title
  )

PCA Plot

Top DEG Expression

Often times we may want to visualize the expression of our top differentially expressed genes. We can do this using a violin plot!

# --- Top DEG Expression -------------------------------------------------------

# grab the top DEG
top_deg <- res %>%
  arrange(pvalue) %>%
  slice_head(n=1) %>%
  rownames()

# plot this gene's expression split by your condition variable
ggpubr::ggviolin(
  data=meta %>%
    mutate(top_deg=assay(rld)[top_deg,]),
  x="Diagnosis",
  y="top_deg",
  fill="Diagnosis",
  color="Diagnosis",
  palette = "npg",
  xlab="",
  ylab = top_deg
)

Top DEG Plot

Enrichment Analysis

Our most significant DEGs often represent some biological change. We can better understand the biology behind our gene list by first filtering by some significance threshold (usually an adjusted p-value of 0.05, however given how subsampled our data is we are setting a p-value cutoff of .2) and then running an overrepresentation test on genes that are associated with some gene ontology term. Here we use clusterProfiler to perform gene ontology enrichment!

# --- ClusterProfiler Enrichment -----------------------------------------------

# enrich significant degs
enriched <- enrichGO(
  gene = res %>%                   # significant degs that pass some threshold
    dplyr::filter(pvalue < .2) %>%
    rownames(),
  OrgDb = 'org.Hs.eg.db',          # organism your samples belong to
  keyType = "SYMBOL",              # type of gene name
  ont = "ALL",                     # enrichment ontology
  pvalueCutoff = 0.1,              # pvalue cutoff
  pAdjustMethod = "none",          # pvalue adjustment method
  universe = rownames(dds),        # what other genes did you test?
  qvalueCutoff = 0.2,              # qvalue cutoff
  minGSSize = 2,                   # minimum number of genes per term
  maxGSSize = 800                  # maxiumum number of genes per term
)

# plot enrichment
enrichplot::dotplot(
  object = enriched,           # enrichment object
  x='Count',                   # x-axis 
  showCategory = 10,           # how many terms to sho
  size='Count',                # what to size by
  color ="p.adjust")+          # what to color by
  scale_color_gradient(
    low="midnightblue",        # change colors of dotplot
    high="thistle"
  )+
  labs(                         
    size="Count",              # change size legend title
    color="Adjusted P-Value"   # change color legend title
  )

Enrichment DotPlot

Differential Expression

Differential Expression

Loading/Cleaning Data

Pre-Processing

Differential Expression

Volcano Plot

PCA Plot

Top DEG Expression

Enrichment Analysis

References