Hierarchical Clustering

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is a bottom up approach wherein observations are their own cluster and then merged into larger and larger clusters until their is one root cluster:

This "hierarchical" view of these clusters is called a dendrogram. Here we will discuss Ward's Method for merging these clusters as it is one of the most popular:

\[D_{12} = \frac{||\overline{x_1} - \overline{x_2}||^2}{\frac{1}{N_1}+\frac{1}{N_2}}\]

Explanation of Terms

\(D_{12}\) distance between clusters 1 and 2
\(N_1\) number of points in cluster 1
\(N_2\) number of points in cluster 2
\(\overline{x_1}\) mean of cluster 1
\(\overline{x_2}\) mean of cluster 2

Pre-Processing

Before we apply k-means we will need to create our distance matrix:

# load the libraries
.libPaths(c("/cluster/tufts/hpc/tools/R/4.0.0"))
library(tidyverse)
library(factoextra)

# load our counts data
counts <- read.csv(
  file="data/gbm_cptac_2021/data_mrna_seq_fpkm.txt",
  header = T,
  sep = "\t")

# make the genes our rownames
rownames(counts) <- make.names(counts$Hugo_Symbol,unique = TRUE)

# remove the gene symbol column
counts <- counts %>%
  select(-c(Hugo_Symbol)) 

# log2 transform our data 
# transpose our data so that our patients are rows
counts <- t(log2(counts + 1))

# Change NA counts to 0
counts[!is.finite(counts)] <- 0

# generate correlation distance matrix
dist <- get_dist(counts,method = "pearson")

# plot correlation distance matrix
fviz_dist(dist) +
  theme(axis.text = element_text(size = 3)) +
  labs(
    title = "Pearson Correlation Distances Between Samples",
    fill = "Pearson Correlation"
  )

Clustering with Ward's method

Let's apply this in R!

# apply ward's clustering
hc <- hclust(d = dist, method = "ward.D2")

# visualizing the dendrogram
# and color by k number of clusters
fviz_dend(hc,
          k = 4, 
          k_colors = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A"))

Info

here we see each sample starts as its own cluster and is gradually merged into larger clusters
we choose to visualize 4 clusters by this is really up to your discretion