Skip to content

Data Visualization

Creating a Barplot

Now do we know how to manipulate data frames we are going to use the ggplot package to plot our different meta data categories we will start by using the tabyl function to tabulate diagnosis by sex and then melt to ensure that categorical variables are in one separate columns while all values are in one column:

# tabulate diagnosis by sex and ensure categorical variables are in separate columns:
diagnosis_sex <- tabyl(merged,diagnosis,Sex) %>%
  reshape2::melt()
diagnosis_sex

output

                diagnosis variable value
1             Control   Female     6
2 Alzheimer's Disease   Female     7
3             Control     Male    12
4 Alzheimer's Disease     Male     8
5             Control      NA_     3
6 Alzheimer's Disease      NA_     6

Here we see that each variable has it's own column - this is necessary for plotting in ggplot:

ggplot(diagnosis_sex,         # data frame to use for plotting
      aes(x=diagnosis,       # set the x axis
          y=value,           # set the y axis
          fill=variable))+   # what column do we fill the bars by
 geom_bar(stat="identity")   # keep the count values as they are

Bar Plot

Fixing ggplot Colors

Above we created a ggplot bar plot! however, it isn't publication worth yet for that we will need to play around with the formatting let's change the colors first!

ggplot(diagnosis_sex,            # data frame to use for plotting
       aes(x=diagnosis,          # set the x axis
           y=value,              # set the y axis
           fill=variable))+      # what column do we fill the bars by
  geom_bar(stat="identity")+     # keep the count values as they are
  scale_fill_manual(values = c(
    c(                           # modify the colors manually
      "Female"="mediumvioletred",
      "Male"="lightsteelblue4",
      "NA_"="grey")
  ))

Bar Plot With Colors

Fixing ggplot themes

Now let's change the theme of the plot! A plot's theme is essentially the background styling. A common theme used in publications is the black and white theme:

ggplot(diagnosis_sex,            # data frame to use for plotting
       aes(x=diagnosis,          # set the x axis
           y=value,              # set the y axis
           fill=variable))+      # what column do we fill the bars by
  geom_bar(stat="identity")+     # keep the count values as they are
  scale_fill_manual(values = c(
    c(                           # modify the colors manually
      "Female"="mediumvioletred",
      "Male"="lightsteelblue4",
      "NA_"="grey")
  ))+
  theme_bw()                     # set the black and white theme   

Bar Plot With Themes

Fixing ggplot Labels

Now let's change the labels of the plot! We will rotate the x axis labels, change the font size, and change the label text!

barplot <- ggplot(
  diagnosis_sex,                        # data frame to use for plotting
       aes(x=diagnosis,                 # set the x axis
           y=value,                     # set the y axis
           fill=variable))+             # what column do we fill the bars by
  geom_bar(stat="identity")+            # keep the count values as they are
  scale_fill_manual(values = c(
    c(                                  # modify the colors manually
      "Female"="mediumvioletred",
      "Male"="lightsteelblue4",
      "NA_"="grey")
  ))+
  theme_bw()+                           # set the black and white theme  
  theme(
    axis.text.x = element_text(         # reference just the axis x text
      angle = 45,                       # rotate the text 45 degrees
      hjust = 1                         # move the text horizontally down by one
    ),
    text =element_text(size = 14)       # increase base text size to 14
  )+
  labs(
    x="",                               # change x axis title to no text
    y="Count",                          # change y axis title 
    fill="Sex",                         # change legend title
    title = ""                          # add a figure title
  )
barplot

Bar Plot With Themes

Creating A PCA Plot

PCA or Principal Component Analysis is a way of visualizing the variation in high dimensional data in just a few dimensions. For more information, check out our tutorial on PCA. Let's examine the variation in our gene expression data and color by disease status:

# run pca on just our counts data (exclude meta data)
pca_res <- prcomp(merged %>%                  # data frame to use
                    select(-names(.)[1:7]))   # unselecting meta data columns

# create a pca plot colored by smoking status using the autoplot function
# and modify it using our ggplot modifiers from above!
pca_plot <- autoplot(pca_res,           # the pca object
         data = merged,                 # the whole data frame (with meta data)
         colour = "diagnosis")+         # what variable to color by
  theme_bw()+                           # set the theme to black & white
  scale_color_manual(                   
    values=c(                           # modify the colors manually
      "Alzheimer's Disease"="mediumvioletred",
      "Control"="lightsteelblue4"))+
  labs(
    color="Diagnosis",                  # change the legend title
    title="PCA Plot Of Expression Data" # chage the figure title
  )
pca_plot

PCA Plot

Running DESeq2

Here we want to cover how to plot differential expression results without going into too much detail. Here we provide a wrapper to run DESeq2 on our counts/meta data:

runDESeq2 <- function(meta_data,
                      counts_data,
                      formula){
  # take a counts data frame with the genes as rownames
  # and the patients as column names and confirm: data is an integer
  # and that the gene name is the first column
  counts_data <- data.frame(gene=rownames(counts_data)) %>%
    cbind.data.frame(sapply(counts_data,as.integer)) %>%
    mutate(tmp_col=gene) %>%
    column_to_rownames("tmp_col")
  # create the deseq object
  dds <- DESeqDataSetFromMatrix(countData=counts_data, 
                                colData=meta_data, 
                                design=formula,
                                tidy = TRUE)
  # run deseq
  dds <- DESeq(dds)
  # shrink log fold changes
  res <- lfcShrink(dds, coef=2, type="apeglm")
  # get/return results data frame
  res <- data.frame(res)
  return(list(
    dds=dds,
    res=res
  ))
}

Let's run DESeq2 on our data and check out the results:

# now let's run DESeq2 on our data!
dds_res <- runDESeq2(
  meta_data = meta_filt,
  counts_data = counts_filt ,
  formula = as.formula("~ diagnosis"))

res <- dds_res$res
# let's take a look at the results
head(res)

DESeq2 Results

           baseMean log2FoldChange     lfcSE      pvalue       padj
AKT3     490.709649     -0.4670004 0.2361901 0.030085339 0.08116548
NR2E3      9.598653      0.6037522 0.2356439 0.004889092 0.05707112
NAALADL1   1.999027      0.7468250 0.5435370 0.042450489 0.09012843
SIGLEC14   2.625559      0.3236496 0.5752871 0.110457222 0.13128830
MIR708     1.504679      0.5881716 0.5340170 0.092049083 0.12021453
NAV2-AS6  27.774062      0.2249761 0.1803118 0.184228395 0.19262298

Cleaning Up Results

Above we have the DESeq2 results for our data. However, it is nice to include columns for the change direction (i.e. up or down regulated) and whether or not the results are significant. Let's add that information in!

# add columns for change direction, modify this column so that only genes
# with a p-value below 0.05 are colored, drop any genes with na 
# fold changes/pvalues and add in a gene column
res <- res %>%
  mutate(change_direction=ifelse(
    log2FoldChange>0 ,
    "Upregulated",
    "Downregulated"
  )) %>%
  mutate(change_direction=ifelse(
    pvalue<0.05 &
      abs(log2FoldChange) > 0.5,
    change_direction,
    "Not Significant"
  )) %>%
  drop_na()

# isolate top genes
top_genes <- res %>%
  filter(change_direction != "Not Significant") %>%
  arrange(pvalue) %>%
  slice_head(n=10) %>%
  rownames(.)

# add in column for top genes
res <- res %>%
  mutate(top_genes=ifelse(
    rownames(.) %in% top_genes,
    rownames(.),
    NA
  ))


# now let's take a look at the new columns!
head(res)

output

           baseMean log2FoldChange     lfcSE      pvalue       padj change_direction top_genes
AKT3     490.709649     -0.4670004 0.2361901 0.030085339 0.08116548  Not Significant      <NA>
NR2E3      9.598653      0.6037522 0.2356439 0.004889092 0.05707112      Upregulated      <NA>
NAALADL1   1.999027      0.7468250 0.5435370 0.042450489 0.09012843      Upregulated      <NA>
SIGLEC14   2.625559      0.3236496 0.5752871 0.110457222 0.13128830  Not Significant      <NA>
MIR708     1.504679      0.5881716 0.5340170 0.092049083 0.12021453  Not Significant      <NA>
NAV2-AS6  27.774062      0.2249761 0.1803118 0.184228395 0.19262298  Not Significant      <NA>

Create a Volcano Plot

A popular way to visualize differential expression results is a volcano plot or a plot of log2 fold change versus the -log10 of the p-value where genes towards the top are more significant:

Volcano Plot

Isolate Normalized Expression Data

DESeq2 normalizes your expression data as such if we intend to plot expression data we will need to use this expression data instead:

# extract normalized data from deseq results
counts_norm <- counts(dds_res$dds, normalized=TRUE) %>%
  t() %>%
  data.frame() %>%
  rownames_to_column("patient_id")

# ensure patients match
meta_patient_to_col <- meta_filt %>%
  rownames_to_column("patient_id")

# merge our data
merged_norm <- meta_patient_to_col %>%
  inner_join(
    .,
    counts_norm,
    by="patient_id"
  )

Create a Expression Box Plots

Often times we may want to visualize the expression of our top differentially expressed genes. We can do this using a box plot!

exp_plot <- ggplot(merged_norm,         # data frame to use                     
       aes(
         x=diagnosis,                   # x axis
         y=LINC02615,                   # y axis
         fill=diagnosis                 # what to fill by
       ))+
  geom_jitter(alpha = .5,width = .2)+
  geom_boxplot(alpha = 0.8)+
  scale_y_log10()+
  theme_bw()+
  scale_fill_manual(values = c(
    c(                                  # modify the colors manually
      "Alzheimer's Disease"="mediumvioletred",
      "Control"="lightsteelblue4")
  )) +
  theme(
    axis.text.x = element_text(         # reference just the axis x text
      angle = 45,                       # rotate the text 45 degrees
      hjust = 1                         # move the text horizontally down by one
    ),
    text =element_text(size = 14)       # increase base text size to 14
  )+
  labs(
    x="",                               # change x axis title to no text
    y="LINC02615 Expression",           # change y axis title 
    fill="Diagnosis",                   # change legend title
  )

exp_plot

Expression Plot

Creating A Heatmap

Along with our gene expression box plots, it is often useful to create a heatmap of our top differentially expressed genes. We will create one using the pheatmap function!

# isolate the top 15 degs
top_15 <- res %>% 
  filter(pvalue <0.05)%>%
  arrange(desc(log2FoldChange)) %>% 
  head(30) %>%
  rownames(.)

# filter DESeq2 normalized counts to select top 15 genes
top_15_counts <- counts(dds_res$dds, normalized=TRUE) %>%
  data.frame() %>%
  filter(rownames(.) %in% top_15) 

# set the colors for our heatmeap annotation
anot_colors <- list(Diagnosis=c(                               
  "Alzheimer's Disease"="mediumvioletred",
  "Control"="lightsteelblue4"))

# select dianosis and make the patient id the rownames
anot_col <- merged_norm %>%
  column_to_rownames("patient_id") %>%
  select(c(diagnosis)) %>%
  arrange(diagnosis) %>%
  dplyr::rename("Diagnosis" = "diagnosis")

heat <- pheatmap::pheatmap(                     # heatmap function
  top_15_counts[,rownames(anot_col)],           # data to input 
  color=colorRampPalette(                       # set specific color range
    c("navy", "white", "red"))(50),
  scale = "row",                                # scale by row
  fontsize = 12,                                # set the font size
  annotation_col =  anot_col,                   # set the annotation variable
  annotation_colors = anot_colors,              # set annotation variables
  cluster_cols = F,                             # do not cluster by column
  cluster_rows = F,                             # do not cluster by row
  show_colnames = F,                            # do not show column names
  main="")                                      # set a main title
heat <- as.ggplot(heat)

Heatmap

Using Images in ggplot

Images can also be insterted into a ggplot let's add an image that describes the study design:

# read in image
img <- readPNG("../data/ad_overview.png", native = TRUE)
# convert to a grob
img <- rasterGrob(img, interpolate=TRUE)
# add grob to a ggplot 
gg_img <- ggplot()+annotation_custom(img, xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf)+theme_void()
gg_img

Study Design Overview

Combining Plots

Side By Side

We can now combine these into a combined figure with the patchwork package!

# use "|" to set plot side by side
top_row <- gg_img|exp_plot
top_row

Top Row Of Figure

Bottom/Top Rows

# define a top and bottom row using "/"
left_side <- top_row/volcano_plot
left_side

Left Side Of Figure

Putting It All Together

# define right side of figure
right_side <- heat

# combine left and right sides
combined <- left_side | right_side

combined

Combined Figure

Adding Labels

# now add at title, caption, and annotation levels starting at A
combined_with_labels <- combined+
  plot_annotation(
    tag_levels = 'A',
    title = "Alzheimer's Disease Neuronal Differential Expression",
    caption = "For Educational Purposes Only"
  ) &
  theme(plot.tag = element_text(size = 20),
        plot.title = element_text(size = 25),
        plot.caption = element_text(size = 20))

combined_with_labels

Combined Figure

Saving Figures

# we can save our new ggplot with the ggsave function:
ggsave(
  filename = "./results/combined_figures.png",
  plot = combined_with_labels
)